On Thu, Jul 01, 2010 at 11:17:31AM -0600, Kevin Van Maren wrote: >My (personal) opinion: > >Lustre clients should always start (mount) automatically.
yup >Lustre servers should have their services started through heartbeat (or >other HA package), if failover is possible (be sure to configure stonith). IMHO that's a bad idea. servers should not start automatically. my objections to automated mount/failover are not Lustre related, but to all layers underneath - as Kevin well knows, mptsas drivers can and do and have screwed up majorly and I'm sure other drivers have too. md is far from smart, and disks are broken in such an infinite amount of weird and wonderful ways that no driver or OS can reasonably be expected to deal with them all :-/ if you have the simple setup of singly-attached storage and a Lustre server just crashed, then why wouldn't it just crash again? we have had that happen. automated startup seems silly in this case - especially if you don't know what the problem was to start with. worst case is if the hardware started corrupting data and crashed the machine, is it really a good idea to reboot, remount, continue corrupting data more, and then keep rebooting until dawn? if you have a more elaborate Lustre setup with HA failover pairs then the above applies, and additionally there are inherent races in both nodes in a pair trying to mount a set of disks if you do not have a third impartial member participating in a failover chorum - not a common HA setup for Lustre, although it probably should be. if a sw raid is assembled on both machines at the same time because of a HA race, then it's likely data will be lost. Lustre mmp should save you from multi-mounting the OST, but obviously not from corruption if the underlying raid is pre-trashed. overall without diagnosing why a machine crashed I fail to see how an automated reboot or failover can possibly be a safe course of action. cheers, robin >If heartbeat starts automatically, do ensure auto-failback is NOT >enabled: fail the resources back manually after you verify the rebooted >server is healthy. >Whether heartbeat starts automatically seems to be a preference issue. > >While unlikely, it is possible for an issue to cause Lustre to not start >successfully, resulting in a node crash or other issue preventing a >login. So if it does start automatically you'll want to be prepared to >reboot w/o Lustre (eg, single-user mode). > >Kevin > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
