On Sep 28, 2009, at 12:46 AM, Aaron Knister wrote: > I wanted to post this here so in the event that anybody else stumbles > across this problem they don't spend hours banging their head against > a brick wall. I was helping with a lustre disk setup that kept > crashing. The lustre filesystem would hang and there would be one > thread (ll_mdt_[0-9]*) that would be pegged at 100% of the cpu. It > turns out there was some on disk inconsistencies as a result of the > MDS crashing because it ran out of memory. A simple fsck of the MDT > fixed the issue, after many hours of attempted debugging. We didn't > think the problem could be fixed by a simple fsck...but it makes > sense.
Recent kernels have additional checks (in do_split(), but in other places as well) to prevent this kind of problems (crash or infinite loop when the layout is corrupted). I wonder if this would catch this problem and return an error instead. Do you know where in do_split() the process was stuck? Cheers, Johann _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
