On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote: > Correct me if I'm wrong, but when I'm looking at Lustre manual, it said > that client is adapting its timeout, but not the server. I'm understood > that server->client RPC still use the old mechanism, especially for our > case where it seems server is revoking a client lock (ldlm_timeout is > used for that?) and client did not respond.
Server and client cooperate together for the adaptive timeouts. I don't remember which bug the ORNL settings were in, maybe 14071, bugzilla's not responding at the moment. But a big question here is why 25315 seconds for a callback - that's well beyond anything at_max should allow... > > I forgot to say that we have LNET routers also involved for some cases. > > Thank you > > Aurélien > > Andreas Dilger a écrit : >> I don't think ldlm_timeout and obd_timeout have much effect when AT is >> enabled. I believe that LLNL has some adjusted tunables for AT that might >> help for you (increased at_min, etc). >> >> Hopefully Chris or someone at LLNL can comment. I think they were also >> documented in bugzilla, though I don't know the bug number. >> >> Cheers, Andreas >> >> On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien <[email protected]> >> wrote: >> >> >>> Hello >>> >>> We often see some of our Lustre clients being evicted abusively (clients >>> seem healthy). >>> The pattern is always the same: >>> >>> All of this on Lustre 2.0, with adaptative timeout enabled >>> >>> 1 - A server complains about a client : >>> ### lock callback timer expired... after 25315s... >>> (nothing on client) >>> >>> (few seconds later) >>> >>> 2 - The client receives -107 to a obd_ping for this target >>> (server says "@@@processing error 107") >>> >>> 3 - Client realize its connection was lost. >>> Client notices it was evicted. >>> It reconnects. >>> >>> (To be sure) When client is evicted, all undergoing I/O are lost, no >>> recovery will be done for that? >>> >>> We are thinking to increase timeout to give more time to clients to >>> answer the ldlm revocation. >>> (maybe it is just too loaded) >>> - Is ldlm_timeout enough to do so? >>> - Do we need to also change obd_timeout in accordance? Is there a risk >>> to trigger new timeouts if we just change ldlm_timeout (cascading timeout). >>> >>> Any feedback in this area is welcomed. >>> >>> Thank you >>> >>> Aurélien Degrémont >>> _______________________________________________ >>> Lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
