On Thu, 2024-07-11 at 12:23 -0400, Michael DiDomenico via lustre- discuss wrote: > i have a strange problem, but honestly i'm not sure its a lustre > issue. but i figure i'll try here. we have users running LLM models > through pytorch. part of the process saves off checkpoints at > periodic intervals. when the checkpoint files are being written we > can see in the logs the pytorch writing out the save files from each > of the processes. > > it chugs along for a little bit, but then comes to a grinding halt. > no error from pytorch is logged and no errors can be found on the > lustre clients or servers. the problem is also no transient, it > happens every time the process runs
does it ever resume or does it stop-stop? If you have a hard stop after which the thing is killed - how long is it? Are the writes synchronous? an you collect lustre debug logs from one of the clients with +vfstrace+cache+rpctrace+inode debug mask may be when the hang happens? How many files are there? I assume there's only a limited number of processes per node? Were obvious things like "a bunch of nodes writing into the same file in O_APPEND mode" already eliminated? (or not in O_APPEND, but doing truncates in between) Also what version are you running? _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
