On Friday, September 17, 2010, Andreas Dilger wrote: > On 2010-09-17, at 12:42, Jonathan B. Horen wrote: > > We're trying to architect a Lustre setup for our group, and want to > > leverage our available resources. In doing so, we've come to consider > > multi-purposing several hosts, so that they'll function simultaneously > > as MDS & OSS. > > You can't do this and expect recovery to work in a robust manner. The > reason is that the MDS is a client of the OSS, and if they are both on the > same node that crashes, the OSS will wait for the MDS "client" to > reconnect and will time out recovery of the real clients.
Well, that is some kind of design problem. Even on separate nodes it can easily happen, that both MDS and OSS fail, for example power outage of the storage rack. In my experience situations like that happen frequently... I think some kind a pre-connection would be required, where a client can tell a server, that it was rebooted and that the server shall not to wait any longer for it. Actually, shouldn't be that difficult, as already different connection flags exist. So if the client contacts a server and ask for an initial connection, the server could check for that NID and then immediately abort recovery for that client. Cheers, Bernd -- Bernd Schubert DataDirect Networks _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
