On Mon, Nov 09, 2009 at 02:48:34PM +0100, Heiko Schröter wrote:
> Hello,
> 
> we do encounter peaks of upto 30% package loss in our Gigabit Network.

It would be helpful if you'd elaborate on where the 30% came from.

> This is sporadic, say once every hour remaining for some seconds. We cannot 
> specify if it extends into minutes.
> We do relate this to a very high peak load on the net.
> 
> Could it be that lustre 'reconnect' messages or 'lnet_try_match_md()' are 
> correlated to this ?

I'm not sure which 'reconnect' you meant, but usually they're rate
limited and backed off exponentially so I'd be surprised that
reconnection requests were overwhelming the network.

The 'lnet_try_match_md()' errors are usually caused by buffer
management problems in Lustre services, which would result in incoming
messages being dropped. If the other end resends those messages
aggressively, it could be a problem but now there's too little clue to
tell.

> i.e. the mds has problems to match infos between osts and mgs ...
> What happens inside lustre when it stumbles across famous 'package loss' on 
> the net ? (Any timeout/retry counters ???)

Usually packet loss is handled by TCP. If you'd enable network error
console logging you'd see some errors when TCP has given up
retransmission: echo +neterror > /proc/sys/lnet/printk

Thanks,
Isaac
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to