One of our OSS's died with a panic last night. Between when it was found (no failover) and restarted two clients had died also. (nodes crashed by user OOM).
Because of this the OST's now are looking for 626 clients to recover when only 624 are up. So the 624 recover in about 15 minutes, but the OST's on that OSS hang waiting for the last two that are dead and not coming back. Note the MDS reports only 624 clients. Is there a a way to tell the OST's to go ahead and evict those two clients and finish recovering? Also "time remaining" has been 0 sense it was booted. How long will the OST's wait before it lets operations continue? Is there any rule to speeding up recovery? The OSS that crashed sees very little cpus/disk/network traffic when recovery is going on so any way to speed it up even if it results in a higher load would be great to know. status: RECOVERING recovery_start: 1217509142 time remaining: 0 connected_clients: 624/626 completed_clients: 624/626 replayed_requests: 0/?? queued_requests: 0 next_transno: 175342162 status: RECOVERING recovery_start: 1217509144 time remaining: 0 connected_clients: 624/626 completed_clients: 624/626 replayed_requests: 0/?? queued_requests: 0 next_transno: 193097794 Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
