In doing some testing with our new hardware I did the following:

I rebooted the active MDS server, it failed over to the second one as  
expected.  While this was happening a client was reset.

When the MDS came up on the new server by heartbeat it went into  
recovery as expected.  The MDS now has been in recovery for 1.5  
hours.  I don't think this is normal.

What would cause this?  I know by having a client go down (the reset  
above) while the MDS is down but before recovery will cause recovery  
to time out but 1.5 hours is unacceptable time to wait for the file  
system to come back.

This is a stock 1.6.5.1 install.

cat recovery_status

status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/1
completed_clients: 0/1
replayed_requests: 0/??
queued_requests: 0
next_transno: 117

Did I some how wedge the file system?



Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to