Hi all, we have a problem with our production system (v. 1.6.5.1). It is in recovery, but recovery never finishes. The background are some unknown problems with the MDT, attempts to restart the MDS etc. The MDT would start recovery, at some point during recovery lose connection to its OSTs, restart recovery and so on.
I then moved the service to a partner machine, where recovery started with >>11:37:07: ... in recovery for at least 5:00, or until 415 clients reconnect. (I always understood these numbers as minutes, the /proc/.../recovery_status usually starts at 3000 sec, though 5 min would be a little less...) The countdown went on until >> 12:03:32: ...227 clients in recovery for 1457s Four minutes later, there were >> 12:07:21: ...133 recoverable clients remain Then something bad must have happened, because >> 12:07:42: ...121 clients in recovery for 20721s Most of these clients seemed to be no problem, because only 4 minutes later >> 12:11:52: ...1 clients in recovery for 20471s So far, the countdown continues, but of course these are extremely long recovery times. My questions: Where might I have misconfigured the system to wait that much for a client? Is there a command to abort the recovery? All the OSTs seem to be connected and happy. I therefore guess that the remaining client is just one client in the ususal sense - a batch node or similar machine that still has the system mounted. Of course I would not hesitate to kick out that client - or many of these if necessary - but I don't know which it is. So another question: How to find out about the identities of clients, recoverable/in recovery/without problems/gone for good ? Many thanks, Thomas _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
