[EMAIL PROTECTED] wrote on Fri, 22 Feb 2008 14:11 -0600:
> We just had this occur..
>
> Is this really a valid assert? Are there any other valid states that can 
> cause a transition to RQ_RTS_WAITING_USER_TEST besides 
> RQ_RTS_WAITING_RTS_DONE?
>
> [D 02/22 13:44] PVFS2 Server version 2.7.1pre1-2008-02-19-171553 starting.
> [E 02/22 13:44] max send/recv sge 14 15
> [E 02/22 13:52] Error: encourage_recv_incoming: mop_id 10164ae0 in RTS_DONE 
> message not found.

The other side did an RDMA, then sent a message saying "the rdma is
done" and referencing the given mop_id.  This side is complaining
it doesn't know about any such mop_id.  There's really not much else
to do here but die.  How did this side forget about the mop_id?  Did
the other side send a duplicate done message?  Any of these things
would be bugs.

Perhaps a cancelled message on the receiver might lead to some sort
of breakage here.  You probably would have logs talking about that.

You could add more debug to this loop

        rq = NULL;
        qlist_for_each_entry(rqt, &ib_device->recvq, list) {
            if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id &&
                rqt->state.recv == RQ_RTS_WAITING_RTS_DONE) {
                rq = rqt;
                break;
            }
        }

to see if it knows about the mop_id but is in the wrong state.  Be
sure not to break, then, as multiple rqt may have the same mop_id,
but no more than one should be waiting for the rts done.

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to