Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11544



Bernd,

I strongly suspect that this bug is expressed only when there are network errors
that cause lustre to recover its connections.   In the logs you last posted an
RPC request sent from the client to [EMAIL PROTECTED] (beo-102) completed with
failure (status 12, IB_WC_RETRY_EXC_ERR - message retries exceeded).  

BTW, these messages do not appear on the console by default since attempts to
communicate with a node that's in reset can also cause them.  However if you
want low-level network error to appear on the console and syslogs, you can run
"echo + neterror > /proc/sys/lnet/printk"

If network failures always occur on 1 client, then I'd want to see if the
problem moves with the client's HCA.  Similarly, if they always occur with the
same server, I'd want to see if it moves with the server's HCA.  Otherwise, I'd
suspect that the fabric has a problem - in fact that's my strongest suspicion.  

At the time that the error noted above was logged, lustre had posted many 1MByte
RDMAs.  In fact lustre can load the fabric with incessant, many-to-many RDMAs
which stress the fabric harder than many test programs or applications.  Maybe
checking and zeroing the fabric error counters periodically to see where most
errors are accumulating could help isolate the problem.

Of course since you have our best reproducer for this bug, we'd rather you
didn't fix the fabric until we've solved it :)

_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to