See inline > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Smith, Stan > Sent: Wednesday, April 30, 2008 3:59 AM > To: Tzachi Dar > Cc: [email protected] > Subject: [ofw] RE: MT25208 vendor status code translation? > > > Forgot to mention the firmware is 4.08.0200 from vstat. > > Smith, Stan wrote: > > Hello, > > Can you point me to a document which would translate and > describe a > > MT25208 vendor status code reported in an IBAL > ib_wc_t.vendor_specific > > field? > > The IBAL error reported is RNR_RETRY_ERR, curious as to what the > > vendor field value (0x87) implies.
0x87 is the vendor code for RNR retry exceeded > > > > The problem we are attempting to understand is that in times of > > 'heavy' MPI induced system/node stress, the IBAL work-completion > > ib_wc_t.wr_id returns in the CQ callback handler set to > zero? Is was > > set as a valid pointer prior to the send post operation. WQE's wr_id is not sent/received, it is kept in an array, related to the WQE's QP. Incorrect wr_id may be returned only when mthca_poll_one() failed to find the QP, related to the CQE in question. It prints "CQ entry for unknown QP %06x" warning in this case. The failure to the QP may occur if the QP has been already destroyed. The driver code handles this situation, but may be there is still a bug, which comes true only under heavy stress. Do you see the above warning when you get wr_id = 0 ? > Without the > > induced system stress (other MPI/DAPL jobs running) the failing test > > runs for days. > > > > Thanks, > > > > Stan. > > _______________________________________________ > ofw mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw > _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
