Something appeared to be messed up. We rebuilt the filesystem and now we cant reproduce the problem. Thanks for looking into it.
I am doing some failover testing right now, see my other emails. Now that I have the MGS seen as two hosts, failover is quite snappy for a known failover, IE reboot on active MDS, heartbeat does what it should. Recovery from yanking power (ipmitool chassis power rest) takes a little longer but still quite fast. I am much happier with lustre failover than I was a few days ago. My own personal growing pains. Thanks again for looking into this. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 On Aug 18, 2008, at 11:02 PM, Andreas Dilger wrote: > On Aug 07, 2008 12:06 -0400, Brock Palen wrote: >> When the MDS came up on the new server by heartbeat it went into >> recovery as expected. The MDS now has been in recovery for 1.5 >> hours. I don't think this is normal. >> >> What would cause this? I know by having a client go down (the reset >> above) while the MDS is down but before recovery will cause recovery >> to time out but 1.5 hours is unacceptable time to wait for the file >> system to come back. > > The recovery should time out in about 5 minutes if the clients do not > reply. Something is definitely wrong. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
