On Jul 31, 2008 10:30 -0400, Brock Palen wrote: > One of our OSS's died with a panic last night. Between when it was > found (no failover) and restarted two clients had died also. (nodes > crashed by user OOM). > > Because of this the OST's now are looking for 626 clients to recover > when only 624 are up. So the 624 recover in about 15 minutes, but > the OST's on that OSS hang waiting for the last two that are dead and > not coming back. Note the MDS reports only 624 clients. > > Is there a a way to tell the OST's to go ahead and evict those two > clients and finish recovering? Also "time remaining" has been 0 > sense it was booted. How long will the OST's wait before it lets > operations continue? > > Is there any rule to speeding up recovery? The OSS that crashed sees > very little cpus/disk/network traffic when recovery is going on so > any way to speed it up even if it results in a higher load would be > great to know. > > status: RECOVERING > recovery_start: 1217509142 > time remaining: 0 > connected_clients: 624/626 > completed_clients: 624/626 > replayed_requests: 0/?? > queued_requests: 0 > next_transno: 175342162 > status: RECOVERING > recovery_start: 1217509144 > time remaining: 0 > connected_clients: 624/626 > completed_clients: 624/626 > replayed_requests: 0/?? > queued_requests: 0 > next_transno: 193097794
The recovery should time out after about 5 minutes (with default 100s timeouts). The recovery goes as fast as clients connect and submit RPCs for replay. In the case where all clients connect then recovery is finished as soon as all clients report completion. Are you saying the system is still stuck in recovery after more than 5 or 10 minutes? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
