I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. I believe that LLNL has some adjusted tunables for AT that might help for you (increased at_min, etc).
Hopefully Chris or someone at LLNL can comment. I think they were also documented in bugzilla, though I don't know the bug number. Cheers, Andreas On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien <[email protected]> wrote: > Hello > > We often see some of our Lustre clients being evicted abusively (clients > seem healthy). > The pattern is always the same: > > All of this on Lustre 2.0, with adaptative timeout enabled > > 1 - A server complains about a client : > ### lock callback timer expired... after 25315s... > (nothing on client) > > (few seconds later) > > 2 - The client receives -107 to a obd_ping for this target > (server says "@@@processing error 107") > > 3 - Client realize its connection was lost. > Client notices it was evicted. > It reconnects. > > (To be sure) When client is evicted, all undergoing I/O are lost, no > recovery will be done for that? > > We are thinking to increase timeout to give more time to clients to > answer the ldlm revocation. > (maybe it is just too loaded) > - Is ldlm_timeout enough to do so? > - Do we need to also change obd_timeout in accordance? Is there a risk > to trigger new timeouts if we just change ldlm_timeout (cascading timeout). > > Any feedback in this area is welcomed. > > Thank you > > Aurélien Degrémont > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
