Hi, On Sep 17, 2010, at 14:49 , Bernd Schubert wrote:
> Hello Cory, > > On 09/17/2010 11:31 PM, Cory Spitz wrote: >> Hi, Bernd. >> >> On 09/17/2010 02:48 PM, Bernd Schubert wrote: >>> On Friday, September 17, 2010, Andreas Dilger wrote: >>>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote: >>>>> We're trying to architect a Lustre setup for our group, and want to >>>>> leverage our available resources. In doing so, we've come to consider >>>>> multi-purposing several hosts, so that they'll function simultaneously >>>>> as MDS & OSS. >>>> >>>> You can't do this and expect recovery to work in a robust manner. The >>>> reason is that the MDS is a client of the OSS, and if they are both on the >>>> same node that crashes, the OSS will wait for the MDS "client" to >>>> reconnect and will time out recovery of the real clients. >>> >>> Well, that is some kind of design problem. Even on separate nodes it can >>> easily happen, that both MDS and OSS fail, for example power outage of the >>> storage rack. In my experience situations like that happen frequently... >>> >> >> I think that just argues that the MDS should be on a separate UPS. > > well, there is not only a single reason. Next hardware issue is that > maybe an IB switch fails. And then have also seen cascading Lustre > failures. It starts with an LBUG on the OSS, which triggers another > problem on the MDS... > Also, for us this actually will become a real problem, which cannot be > easily solved. So this issue will become a DDN priority. There is always a possibility that multiple failures will occur, and this possibility can be reduced depending on one's resources. The point here is simply that a configuration with an mds and oss on the same node will guarantee multiple failures and aborted OSS recovery when that node fails. cheers, robert _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
