[lustre-discuss] Identifying High I/O Jobs Causing OST to Hang

Ihsan Ur Rahman via lustre-discuss Tue, 27 May 2025 13:20:39 -0700

Hello Lustre community,

I've searched through the documentation and various forums but haven’t
found a clear solution for this issue.


We have a Lustre setup with 10 OSS nodes, each hosting 3 OSTs.
Occasionally, one of the OSTs becomes unresponsive, and we’re forced to
reboot the corresponding OSS to restore functionality. The logs show an
error like:

bulk IO read error with 5c06fsdf-xxxxxxx, client will retry, rc=-110

This Lustre filesystem is primarily used for SLURM jobs running AI/ML
workloads.

I’m trying to identify which SLURM job or user is initiating high I/O
operations that could be causing these hangs, so that we can investigate or
temporarily stop that user/job. I’ve tried setting the job ID tracking with:

lctl set_param -P jobid_var=SLURM_JOB_ID

But it doesn't seem to be working as expected.

Does anyone have a reliable method for identifying SLURM users or jobs
responsible for high I/O operations on Lustre?, and how can i mitigate the
Hang OSTs

Any insights or suggestions would be greatly appreciated. If further
details are required, I am at your disposal.


Regards,


Ihsan Ur Rahman

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Identifying High I/O Jobs Causing OST to Hang

Reply via email to