Steve, what happens when you run it with strace?

I am running the same version of nvidia-smi as you, and noticing that it 
allocates a TON of memory now, for no (apparent) reason.  That started 
happening sometime in the past few months.  I think it may be related to the 
symptoms you describe, but I am not completely sure.

Here's the relevant snippet from the output of `strace nvidia-smi`:

2285205 stat("/var/run/nvidia-persistenced/socket", {st_mode=S_IFSOCK|0777, 
st_size=0, ...}) = 0
2285205 socket(AF_UNIX, SOCK_STREAM, 0) = 9
2285205 connect(9, {sa_family=AF_UNIX, 
sun_path="/var/run/nvidia-persistenced/socket"}, 37) = 0
2285205 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
2285205 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1073741816}) 
= 0
2285205 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 
-1, 0) = 0x7f5709400000
2285205 mmap(NULL, 51539607552, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b09400000
2285205 getpeername(9, {sa_family=AF_UNIX, 
sun_path="/var/run/nvidia-persistenced/socket"}, [128 => 38]) = 0

Quick summary: it opens the persistenced socket, gets the max file descriptor 
limit (1 billion), allocates 4GB of memory, allocates another 51GB of memory 
(!!!), and then proceeds to use the persistenced socket.

This happens after reporting the card and driver versions, and before listing 
the processes.  Whatever it's doing during this time period, it delays 
execution for ~25 seconds, too.  The `top` command says that the nvidia-smi 
process has 52GB virt, 49GB resident.

    PID  VIRT    RES S  %CPU  %MEM     TIME+ COMMAND
2285205 52.0g  49.6g R 100.0  39.4   0:22.14 nvidia-smi

If you have less memory than it's asking for, that might be a reason for your 
machine to go into swap hell and eventually freeze.

I'm seeing this on Debian Trixie.  Package versions:

||/ Name                     Version      Architecture
+++-========================-============-============
ii  libcuda1:amd64           535.183.01-1 amd64
ii  libnvidia-ml1:amd64      535.183.01-1 amd64
ii  linux-image-6.8.12-amd64 6.8.12-1     amd64
ii  nvidia-kernel-dkms       535.183.01-1 amd64
ii  nvidia-persistenced      535.171.04-1 amd64
ii  nvidia-smi               535.183.01-1 amd64

Thanks,
Mark

Reply via email to