We have lustre 1.6.5.1 on our cluster (CentOS 4.7). The /home filesystem contains 27TB, distributed over three OSS servers, each containing a RAID split into two filesystems. It's 90% full and contains 2 million files.
It is not particularly stable. Every 1-3 weeks the filesystem goes awol and I have to reboot the machine. This morning I did an "ls -lR" on the front-end (which serves as MDS), just to count the files, and it took more than one hour. "top" showed "ls" taking up anything between 5% and 90% of a CPU during this time (most of the time in the 10-30% range). Is this normal? The crash this weekend was preceded by half a day during which the front-end kept losing and regaining connection to the filesystem. It worked for a while, then "df" gave an "input/output error", or "Cannot send after transport endpoint", then recovered again. It seemed OK all the time from the compute nodes, and from one external Lustre client (until it went away completely). I have inherited this cluster and I am not an expert in filesystems. The timeout is set to its default 100s. How do I find out what's wrong? Herbert -- Herbert Fruchtl Senior Scientific Computing Officer School of Chemistry, School of Mathematics and Statistics University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
