Steve is right.
When I run `ulimit -Hn 1048576`, the memory allocation according to `strace` is greatly reduced, and the nvidia-smi command completes immediately.
2301004 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0 2301004 mmap(NULL, 4198400, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4702800000 2301004 mmap(NULL, 50335744, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f46ff600000
4GB and 51GB became 4MB and 50MB. So the memory usage (and initialization overhead?) scales with the max file descriptor limit.
Thanks, Mark

