Ok. at an ETA of 8100 sec we lost patience and did > lctl --device MDS-Name abort_recovery
This obviously did the trick, >> recovery period over; 1 clients never reconnected after 14483s (414 clients did) Access to the system seems to work as expected. Still we are not satisfied at all. One thing we would like to know, urgently, is how to find out which client caused that delay. As indicated before, we have no problem nuking a silly client, tearing it apart, ripping out its memory banks or whatever violent action might be needed. Most probably, though, the fault lies within our configuration, not this single client ( perhaps this is a machine that had a Lustre mount some time ago and is now switched off - batch nodes tend to die every now and then). Our /proc/sys/lustre/timeout is 1000 - there has been some debate on this large value here, but most other installation will not run in a network environment with a setup as crazy as ours. Putting the timeout to 100 immediately results in "Transport endpoint" errors, impossible to run Lustre like this. Since this is a 1.6.5.1 system, I activated the adaptive timeouts - and put them to equally large values, /sys/module/ptlrpc/parameters/at_max = 6000 /sys/module/ptlrpc/parameters/at_history = 6000 /sys/module/ptlrpc/parameters/at_early_margin = 50 /sys/module/ptlrpc/parameters/at_extra = 30 Reading the manual, I understood that at_max is a maximum value. I learned from an earlier question I posted on this list that with the static timeout from /proc/sys/lustre/timeout, recovery will be 2.5 times this value. Assuming the worst, 2.5 times at_max, I still don't arrive at 21000 sec ! So I'm quite clueless as to what mistakes I have made here. Btw, when trying to find out about connected/disconnected clients, I ran "lctl conn_list", which gave me a very long listing (how do you do " |less" in this lctl - shell?), with all entries marked as "nonagle" - what does that mean? Oh, last remark for the records: to do this "lctl abort_recovery" command, you have to find out the right device number or name. "lctl dl" gives me five entries on my MGS/MDT server, "mgs", "mgc" "mdt" "lov" "mds". The correct device name for the lctl command is the one after "mds". Regards, Thomas Thomas Roth wrote: > Hi all, > > we have a problem with our production system (v. 1.6.5.1). It is in > recovery, but recovery never finishes. > The background are some unknown problems with the MDT, attempts to > restart the MDS etc. The MDT would start recovery, at some point during > recovery lose connection to its OSTs, restart recovery and so on. > > I then moved the service to a partner machine, where recovery started with >>> 11:37:07: ... in recovery for at least 5:00, or until 415 clients > reconnect. > > (I always understood these numbers as minutes, the > /proc/.../recovery_status usually starts at 3000 sec, though 5 min would > be a little less...) > > The countdown went on until >>> 12:03:32: ...227 clients in recovery for 1457s > > Four minutes later, there were >>> 12:07:21: ...133 recoverable clients remain > > Then something bad must have happened, because >>> 12:07:42: ...121 clients in recovery for 20721s > > Most of these clients seemed to be no problem, because only 4 minutes later >>> 12:11:52: ...1 clients in recovery for 20471s > > So far, the countdown continues, but of course these are extremely long > recovery times. > > My questions: > Where might I have misconfigured the system to wait that much for a client? > Is there a command to abort the recovery? > > All the OSTs seem to be connected and happy. I therefore guess that the > remaining client is just one client in the ususal sense - a batch node > or similar machine that still has the system mounted. Of course I would > not hesitate to kick out that client - or many of these if necessary - > but I don't know which it is. So another question: How to find out > about the identities of clients, recoverable/in recovery/without > problems/gone for good ? > > > Many thanks, > Thomas > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss -- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum fu"r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschra"nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gescha"ftsfu"hrer: Professor Dr. Horst Sto"cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
