Hi all, just want to share my recent insight and increase the number of Google hits for those who suffer from - MDT / filesystem becoming suddenly unusable - LustreError: ... lock callback timer expired ... - LustreError: ... lock on destroyed export ... - Lustre: ... Stealing 1 locks ... - Lustre: ... All locks stolen ... - LustreError: ... busy with active 2 RPCs ... - LustreError: ... operation 400 on unconnected MDS ...
All of these and more we have seen on the MDT of our 1.6.7.2-Cluster after running for one year without major problems. For the last 2 weeks the system hasn't had an uptime of more than 30h, though. We found a user job submission script that probably caused all this by starting - several hundred (900) jobs simultaneously - all of them opening one and the same file for batch system errors and one and the same file for its output. So if someone is sitting in front of an uncooperative MDT, dazed and confused as I was, perhaps this is the direction to investigate. Still I'd like to learn more about "operation X on unconnected MDS", on the net I only found my own question from two years ago. Regards, Thomas -- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt Gesellschaft mit beschränkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Geschäftsführung: Professor Dr. Dr. h.c. Horst Stöcker, Christiane Neumann, Dr. Hartmut Eickhoff Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
