Thanks for the notice but I had already checked that, and its was ok.
I will see if my latest tunings will change something.

Aurélien

Le 08/05/2011 05:29, Andreas Dilger a écrit :
> Aurelien, now that I think about it, it may be that the LNET errors are 
> turned off by default. You should check if the "neterr" debug flag is on. 
> Otherwise LNET errors are nor printed to the console by default.
>
> Cheers, Andreas
>
> On 2011-05-04, at 8:05 AM, DEGREMONT Aurelien<[email protected]>  
> wrote:
>
>> Johann Lombardi a écrit :
>>> On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote:
>>>
>>>>> I assume that the 25315s is from a bug
>>>>>
>>> BTW, do you see this problem with both extent&  inodebits locks?
>>>
>> Yes both. But more often on MDS.
>>>> How can I track those dropped RPCs on routers?
>>>>
>>> I don't think routers can drop RPCs w/o a good reason. It is just that a 
>>> router failure can lead to packet loss and given that servers don't resend 
>>> local callbacks, this can result in client evictions.
>>>
>> Currently I do not see any issue with the routers.
>> Logs are very silent and load is very low. Nothing looks like router failure.
>> If LNET decides to drop packet for some buggy reason, I would expect to have 
>> it, at least, say something in kernel log ("omg i've drop 2 packets, please 
>> expect evictions :))"
>>
>>>> if client/server do not re-send their RPC.
>>>>
>>> To be clear, clients go through a disconnect/reconnect cycle and eventually 
>>> resend RPCs.
>>>
>> I'm not sure I understand clearly what happens there.
>> If client did not respond to server ast, it will be evicted by the server. 
>> Server do not seem to send a message to tell it (why bother as it seems it 
>> is unresponsive or dead anyway?).
>> Client realizes at next obd_ping that connection does not exist anymore 
>> (rc=-107 ENOTCONN).
>> Then it try to reconnect, and at that time, server tells it, it is really 
>> evicted. Client says "in progress operation will fail". AFAIK, this means 
>> dropping all locks, all dirty pages. Async I/O are lost. Connection status 
>> becomes EVICTED. I/O during this window will receive -108, ESHUTDOWN, 
>> (kernel log said @@@ IMP_INVALID, see ptlrpc_import_delay_req()).
>> Then client reconnects, but some I/O were lost, user program could have 
>> experienced errors from I/O syscall.
>>
>> This is not the same as a connection timeout, where client will try a 
>> failover and do a disconnect/recovery cycle, everything is ok.
>>
>> Is this correct?
>>
>>> That's bug 3622. Fanyong also used to work on a patch, see 
>>> http://review.whamcloud.com/#change,125.
>>>
>> This looks very interesting as it seems to match our issue. But 
>> unfortunately, no news since 2 months.
>>
>>
>>
>> Aurélien
>>

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to