On May 4, 2011, at 10:05 AM, Charles Taylor wrote: > > We are dipping our toes into the waters of Lustre HA using > pacemaker. We have 16 7.2 TB OSTs across 4 OSSs (4 OSTs each). > The four OSSs are broken out into two dual-active pairs running Lustre > 1.8.5. Mostly, the water is fine but we've encountered a few > surprises. > > 1. An 8-client iozone write test in which we write 64 files of 1.7 > TB each seems to go well - until the end at which point iozone seems > to finish successfully and begins its "cleanup". That is to say it > starts to remove all 64 large files. At this point, the ll_ost > threads go bananas - consuming all available cpu cycles on all 8 cores > of each server. This seems to block the corosync "totem" exchange > long enough to initiate a "stonith" request.
Running oprofile or profile.pl (possibly only included in SGI's respin of perfsuite, original is at http://perfsuite.ncsa.illinois.edu/) is useful in situations where you have one or more thread consuming a lot of CPU. It should point to what function(s) the offending thread(s) are spending time in. From there, bugzilla/jira or the mailing list should be able to help further. > 2. We have found that re-mounting the OSTs, either via the HA agent or > manually, often can take a *very* long time - on the order of four or > five minutes. We have not figured out why yet. An strace of the > mount process has not yielded much. The mount seems to just be > waiting for something but we can't tell what. Could be bz 18456. Jason -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
