On Wed, Aug 26, 2009 at 06:52:24PM -0700, Abe Ingersoll wrote: > ...... > kiblnd_tx_complete()) Tx -> 10.168.22....@o2ib cookie 0xc8dd6 sending 1 > waiting 1: failed 12
12 == IB_WC_RETRY_EXC_ERR, which usually indicates faulty links in the network or some other application (like a MPI application) hogging network resources unfavorably against Lustre. We once observed such errors at times there was no IO at all - a bad MPI implementation was resending aggressively upon RNR such that even the tiny bit of keepalive traffic from Lustre would end up with IB_WC_RETRY_EXC_ERR. Diagnostics from OFED and the fabric should point you to faulty hardware, and setting up IB QoS should prevent Lustre from being hurt badly by someone else. Meanwhile, there's a potential workaround mentioned here: https://bugzilla.lustre.org/show_bug.cgi?id=14223#c36 But it's certainly not a good solution in the long run. Thanks, Isaac _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
