We recently updated to Lustre 2.8 on our cluster, and have started seeing some unusal load issues. Last night our MDS load climbed to well over 100, and client performance dropped to almost zero. Initially this appeared to be related to a number of jobs that were doing large numbers of opens/closes, but even after killing those jobs, the MDS load did not recover.
Looking at stats in /proc/fs/lustre/mdt/scratch-MDT0000/exports showed little to no activity on the MDS. Looking at iostat showed almost no disk activity to the MDT (or to any device, for that matter), and minimal IO wait. Memory usage (the machine has 128GB) showed over half of that memory free. I eventually ended up unmounting the MDT and failing it over to a backup MDS, which promptly recovered and now has a load of near zero. Has anyone seen this before? Any suggestions for what I should look at if this happens again? Thanks! Kevin -- Kevin Hildebrand University of Maryland, College Park Division of IT
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org