Hi all, after running for days without any problems, our MDS is refusing cooperation for two hours now. The log files show nothing until >Mar 5 16:46:24 mds1 kernel: Lustre: 17841:0:(ldlm_lib.c:525:target_handle_reconnect()) MDT0000: 481fa70b-590d -31b6-f621-c6125a54bfff reconnecting >Mar 5 16:46:24 mds1 kernel: Lustre: 17841:0:(ldlm_lib.c:760:target_handle_connect()) MDT0000: refuse reconnec tion from [email protected]@tcp to 0xffff8107ef44a000; still busy with 2 active RPCs
I thought that such a thing would be between the MDT and this particular client. However, the log goes on like that with many other clients. Now the MDS is refusing any connection, bringing the system to a stand still. The situation also triggered the dumping of ca. 130 log dumps to /tmp. Most of these are small and contain just >Watchdog triggered for pid 17866: it was inactive for 12000s >nable to dump stack because of missing export A few are larger and contain more complaints about lengthy requests and possible timeouts: >ptlrpc_server_handle_request Request x75091039 took longer than estimated (42+4208s); client may timeout. or >ptlrpc_server_handle_request Dropping timed-out request from 12345-140.181.114....@tcp: deadline 1000+923s ago All of these do not seem critical? Maybe all clients have timed out for some reason? Even so, I'd assume the MDS to be still responsive, say to a mount request from a fresh client, one that does not possibly have any leftover transactions pending on it? Right now the only thing I see to do is to reboot the server. Of course not a nice procedure on a system we advertised as stable and reliable to our users... So any help will be much appreciated. Regards, Thomas _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
