On Mon, Nov 09, 2009 at 02:48:34PM +0100, Heiko Schröter wrote: > Hello, > > we do encounter peaks of upto 30% package loss in our Gigabit Network.
It would be helpful if you'd elaborate on where the 30% came from. > This is sporadic, say once every hour remaining for some seconds. We cannot > specify if it extends into minutes. > We do relate this to a very high peak load on the net. > > Could it be that lustre 'reconnect' messages or 'lnet_try_match_md()' are > correlated to this ? I'm not sure which 'reconnect' you meant, but usually they're rate limited and backed off exponentially so I'd be surprised that reconnection requests were overwhelming the network. The 'lnet_try_match_md()' errors are usually caused by buffer management problems in Lustre services, which would result in incoming messages being dropped. If the other end resends those messages aggressively, it could be a problem but now there's too little clue to tell. > i.e. the mds has problems to match infos between osts and mgs ... > What happens inside lustre when it stumbles across famous 'package loss' on > the net ? (Any timeout/retry counters ???) Usually packet loss is handled by TCP. If you'd enable network error console logging you'd see some errors when TCP has given up retransmission: echo +neterror > /proc/sys/lnet/printk Thanks, Isaac _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
