In doing some testing with our new hardware I did the following: I rebooted the active MDS server, it failed over to the second one as expected. While this was happening a client was reset.
When the MDS came up on the new server by heartbeat it went into recovery as expected. The MDS now has been in recovery for 1.5 hours. I don't think this is normal. What would cause this? I know by having a client go down (the reset above) while the MDS is down but before recovery will cause recovery to time out but 1.5 hours is unacceptable time to wait for the file system to come back. This is a stock 1.6.5.1 install. cat recovery_status status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/1 completed_clients: 0/1 replayed_requests: 0/?? queued_requests: 0 next_transno: 117 Did I some how wedge the file system? Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
