On Jun 13, 2008 16:03 -0400, Charles Taylor wrote: > We have been running the config below on three different lustre file > systems since early January and, for the most part, things have been > pretty stable. We are now experiencing frequent hangs on some > clients - particularly our interactive login nodes. All processes > get blocked behind Lustre I/O requests. When this happens there are > *no* messages in either dmesg or syslog on the clients. They seem > unaware of a problem.
This is likely due to "client statahead" problems. Please disable this with "echo 0 > /proc/fs/lustre/llite/*/statahead_max" on the clients. This should also be fixed in 1.6.5 > 1. A ton of lustre-log.M.N files get dumped into /tmp in a short > period of time. Most of them appear to be full of garbage and > unprintable characters rather than thread stack traces. Many of them > are also zero length. The lustre-log files are not stack traces. They are dumped lustre debug logs. > We have been adjusting lru_size on the clients but so far it has made > no difference. We have "options mds mds_num_threads=512" and our > system timeout is 1000 (sure, go ahead and flame me but if we don't do > that we get tons of "endpoint transport failures" on the clients and > no, there are no connectivity issues). :) > > We are open to suggestion and wondering if we should update the MDSs > to 1.6.5. Can we do that safely without also upgrading the clients > and OSTs? In general the MDS and OSS nodes should run the same level of software, as that is what we test, but there isn't a hard requirement for it. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
