Hello Lustre community, I've searched through the documentation and various forums but haven’t found a clear solution for this issue.
We have a Lustre setup with 10 OSS nodes, each hosting 3 OSTs. Occasionally, one of the OSTs becomes unresponsive, and we’re forced to reboot the corresponding OSS to restore functionality. The logs show an error like: bulk IO read error with 5c06fsdf-xxxxxxx, client will retry, rc=-110 This Lustre filesystem is primarily used for SLURM jobs running AI/ML workloads. I’m trying to identify which SLURM job or user is initiating high I/O operations that could be causing these hangs, so that we can investigate or temporarily stop that user/job. I’ve tried setting the job ID tracking with: lctl set_param -P jobid_var=SLURM_JOB_ID But it doesn't seem to be working as expected. Does anyone have a reliable method for identifying SLURM users or jobs responsible for high I/O operations on Lustre?, and how can i mitigate the Hang OSTs Any insights or suggestions would be greatly appreciated. If further details are required, I am at your disposal. Regards, Ihsan Ur Rahman
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
