On Jul 31, 2008 20:45 +0200, Thomas Roth wrote: > I'm still successful in bringing my OSSs to a standstill if not crashing > them. > Having reduced the number of stress jobs writing to Lustre (stress -d 2 > --hdd-noclean --hdd-bytes 5M) to four, and having reduced the number of > OSS threads (options ost oss_num_threads=256 in /etc/modprobe.d/lustre), > the OSS do not freeze entirely any more. Instead after ~ 15 hours, > - all stress jobs have terminated with Input/output error > - the MDT has marked the affected OSTs as Inactive > - the already open connections to the OSS remain active > - interactive collectl, "watch df", top sessions are still working > - the number of ll_ost threads is 256 ( number of ll_ost_io is 257 ?) > - log file writing has obviously stopped after only 10 hours > - already open shells allow commands like "ps", I can kill some processes > - new ssh login doesn't work > - access to disk, as in "ls", brings the system to total freeze > > The process table shows six ll_ost_io - threads, all using 38.9% cpu, > all running for 419:21m. All the rest are sleeping. > The cause can't be system overloading or simple faulty hardware.
You need to look at the process table (sysrq-t) and get the stacks of the running and blocked lustre processes. Also useful would be the memory information (sysrq-m) to see if the node is out of free memory, and if so where it is gone. If you can still run some commands, then "cat /proc/slabinfo" may also be useful. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
