On Jan 20, 2009 11:11 +0000, Herbert Fruchtl wrote: > It is not particularly stable. Every 1-3 weeks the filesystem goes awol and I > have to reboot the machine. This morning I did an "ls -lR" on the front-end > (which serves as MDS), just to count the files, and it took more than one > hour. > "top" showed "ls" taking up anything between 5% and 90% of a CPU during this > time (most of the time in the 10-30% range). Is this normal? > > The crash this weekend was preceded by half a day during which the front-end > kept losing and regaining connection to the filesystem. It worked for a while, > then "df" gave an "input/output error", or "Cannot send after transport > endpoint", then recovered again. It seemed OK all the time from the compute > nodes, and from one external Lustre client (until it went away completely). > > I have inherited this cluster and I am not an expert in filesystems. The > timeout is set to its default 100s. How do I find out what's wrong?
There was a series of bugs related to "statahead" in the 1.6.5.1 release that could cause problems with "ls -lR" type workloads. You should cat disable this feature with "echo 0 > /proc/fs/lustre/llite/*/statahead_max", or upgrading at least the clients to the 1.6.6 release. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
