Hello We often see some of our Lustre clients being evicted abusively (clients seem healthy). The pattern is always the same:
All of this on Lustre 2.0, with adaptative timeout enabled 1 - A server complains about a client : ### lock callback timer expired... after 25315s... (nothing on client) (few seconds later) 2 - The client receives -107 to a obd_ping for this target (server says "@@@processing error 107") 3 - Client realize its connection was lost. Client notices it was evicted. It reconnects. (To be sure) When client is evicted, all undergoing I/O are lost, no recovery will be done for that? We are thinking to increase timeout to give more time to clients to answer the ldlm revocation. (maybe it is just too loaded) - Is ldlm_timeout enough to do so? - Do we need to also change obd_timeout in accordance? Is there a risk to trigger new timeouts if we just change ldlm_timeout (cascading timeout). Any feedback in this area is welcomed. Thank you Aurélien Degrémont _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
