On May 27, 2025, at 14:12, Ihsan Ur Rahman via lustre-discuss 
<[email protected]> wrote:


Hello Lustre community,

I've searched through the documentation and various forums but haven’t found a 
clear solution for this issue.

We have a Lustre setup with 10 OSS nodes, each hosting 3 OSTs. Occasionally, 
one of the OSTs becomes unresponsive, and we’re forced to reboot the 
corresponding OSS to restore functionality. The logs show an error like:

bulk IO read error with 5c06fsdf-xxxxxxx, client will retry, rc=-110


This Lustre filesystem is primarily used for SLURM jobs running AI/ML workloads.

I’m trying to identify which SLURM job or user is initiating high I/O 
operations that could be causing these hangs, so that we can investigate or 
temporarily stop that user/job. I’ve tried setting the job ID tracking with:

lctl set_param -P jobid_var=SLURM_JOB_ID


But it doesn't seem to be working as expected.

Can you provide some details of what isn't working?  Does the "jobid_name" 
variable contain "%j" to include the jobid from SLURM_JOB_ID?

Does anyone have a reliable method for identifying SLURM users or jobs 
responsible for high I/O operations on Lustre?, and how can i mitigate the Hang 
OSTs

Any insights or suggestions would be greatly appreciated. If further details 
are required, I am at your disposal.

If you have the JobStats working, you could try using the "lljobstat" tool to 
monitor the jobs doing the most RPCs to single server node.

Cheers, Andreas
—
Andreas Dilger
Lustre Principal Architect
Whamcloud/DDN




_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to