See inline > -----Original Message----- > From: Smith, Stan [mailto:[EMAIL PROTECTED] > Sent: Thursday, May 01, 2008 2:37 AM > To: Leonid Keller; Tzachi Dar > Cc: [email protected] > Subject: RE: [ofw] RE: MT25208 vendor status code translation? > > Leonid Keller wrote: > > See inline > > > >> -----Original Message----- > >> From: [EMAIL PROTECTED] > >> [mailto:[EMAIL PROTECTED] On Behalf Of Smith, Stan > >> Sent: Wednesday, April 30, 2008 3:59 AM > >> To: Tzachi Dar > >> Cc: [email protected] > >> Subject: [ofw] RE: MT25208 vendor status code translation? > >> > >> > >> Forgot to mention the firmware is 4.08.0200 from vstat. > >> > >> Smith, Stan wrote: > >>> Hello, > >>> Can you point me to a document which would translate > and describe > >>> a MT25208 vendor status code reported in an IBAL > >>> ib_wc_t.vendor_specific field? The IBAL error reported is > >>> RNR_RETRY_ERR, curious as to what the vendor field value (0x87) > >>> implies. > > > > 0x87 is the vendor code for RNR retry exceeded > > > Thanks for the decode. > > Turns out, as suspected, the receiver was not posting > receives fast enough, hence the rnr TO logic kicked in due to > a small rnr_retry_cnt with a short rnr_nak_timeout. Increased > both values - problem has gone away for now. > > > > > >>> > >>> The problem we are attempting to understand is that in times of > >>> 'heavy' MPI induced system/node stress, the IBAL work-completion > >>> ib_wc_t.wr_id returns in the CQ callback handler set to > zero? Is was > >>> set as a valid pointer prior to the send post operation. > > > > WQE's wr_id is not sent/received, it is kept in an array, > related to > > the WQE's QP. > > Incorrect wr_id may be returned only when mthca_poll_one() > failed to > > find the QP, related to the CQE in question. > > It prints "CQ entry for unknown QP %06x" warning in this case. > > The failure to the QP may occur if the QP has been already > destroyed. > > The driver code handles this situation, but may be there is still a > > bug, which comes true only under heavy stress. > > > > Do you see the above warning when you get wr_id = 0 ? > > > Found nothing in the system event log.
It is not there, it's a debugger output. > > > > > > >> Without the > >>> induced system stress (other MPI/DAPL jobs running) the > failing test > >>> runs for days. > >>> > >>> Thanks, > >>> > >>> Stan. > >> > >> _______________________________________________ > >> ofw mailing list > >> [email protected] > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw > > _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
