Hi Lisa; We don't start the services automatically on our servers. We don't have so many Lustre servers that this is a big problem (17 total), and it is pretty rare for one of them to go down unexpectedly.
If one of our Lustre server node does go down unexpectedly, we fsck the associated OSTs/MDT before starting up Lustre services again. I think you will want to do the same. We do the fsck from the command line and look at the output. If there were no filesystem modifications (this is the usual case), we then start the Lustre services interactively. If there were modifications from fsck, we'll generally fsck it again and verify there were no further modifications. If 'fsck -f -p' fails, we'll fsck interactively or just go whole hog and 'fsck -f -y'. I imagine you could achieve an "automated startup following failure" at least most of the time with an init script that does an 'fsck -f -p' on the associated OSTs/MDT if the node is coming back up from a crash or power outage. If there aren't any modifications made by fsck, your init script could mount the storage. If 'fsck -f -p' bails out, you might send out an "I need help" email or something. Cheers, Craig Prescott UF HPC Center We once ran a cluster with lustre We bought from a guy named Buster It ran for a year with nary a tear A complaint we could not muster Lisa Giacchetti wrote: > Hello, > I have recently installed a lustre cluster which is in a test phase now > but will potentially be in 24x7 production if its accepted. > I would like input from the list on what the recommendations/best > practices are for configuration of a lustre cluster startup. > Is it advisable to have lustre on the various server pieces > (mgs/mdt/oss's) start automatically? If not why not? > If you try to start it and there is a very serious problem will it > abort the startup or just continue on blindly? > > Again this is going to need to be a 24x7 service for a compute facility > that which has global access (ie someone is always > up and running something). We'd like to be able to at least get the > service back up in an automated way if at all possible and then debug > problems when the support staff are awake/available. > > Lisa Giacchetti > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
