Thanks for the notice but I had already checked that, and its was ok. I will see if my latest tunings will change something.
Aurélien Le 08/05/2011 05:29, Andreas Dilger a écrit : > Aurelien, now that I think about it, it may be that the LNET errors are > turned off by default. You should check if the "neterr" debug flag is on. > Otherwise LNET errors are nor printed to the console by default. > > Cheers, Andreas > > On 2011-05-04, at 8:05 AM, DEGREMONT Aurelien<[email protected]> > wrote: > >> Johann Lombardi a écrit : >>> On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote: >>> >>>>> I assume that the 25315s is from a bug >>>>> >>> BTW, do you see this problem with both extent& inodebits locks? >>> >> Yes both. But more often on MDS. >>>> How can I track those dropped RPCs on routers? >>>> >>> I don't think routers can drop RPCs w/o a good reason. It is just that a >>> router failure can lead to packet loss and given that servers don't resend >>> local callbacks, this can result in client evictions. >>> >> Currently I do not see any issue with the routers. >> Logs are very silent and load is very low. Nothing looks like router failure. >> If LNET decides to drop packet for some buggy reason, I would expect to have >> it, at least, say something in kernel log ("omg i've drop 2 packets, please >> expect evictions :))" >> >>>> if client/server do not re-send their RPC. >>>> >>> To be clear, clients go through a disconnect/reconnect cycle and eventually >>> resend RPCs. >>> >> I'm not sure I understand clearly what happens there. >> If client did not respond to server ast, it will be evicted by the server. >> Server do not seem to send a message to tell it (why bother as it seems it >> is unresponsive or dead anyway?). >> Client realizes at next obd_ping that connection does not exist anymore >> (rc=-107 ENOTCONN). >> Then it try to reconnect, and at that time, server tells it, it is really >> evicted. Client says "in progress operation will fail". AFAIK, this means >> dropping all locks, all dirty pages. Async I/O are lost. Connection status >> becomes EVICTED. I/O during this window will receive -108, ESHUTDOWN, >> (kernel log said @@@ IMP_INVALID, see ptlrpc_import_delay_req()). >> Then client reconnects, but some I/O were lost, user program could have >> experienced errors from I/O syscall. >> >> This is not the same as a connection timeout, where client will try a >> failover and do a disconnect/recovery cycle, everything is ok. >> >> Is this correct? >> >>> That's bug 3622. Fanyong also used to work on a patch, see >>> http://review.whamcloud.com/#change,125. >>> >> This looks very interesting as it seems to match our issue. But >> unfortunately, no news since 2 months. >> >> >> >> Aurélien >> _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
