[EMAIL PROTECTED] wrote on Fri, 22 Feb 2008 14:11 -0600:
> We just had this occur..
>
> Is this really a valid assert? Are there any other valid states that can
> cause a transition to RQ_RTS_WAITING_USER_TEST besides
> RQ_RTS_WAITING_RTS_DONE?
>
> [D 02/22 13:44] PVFS2 Server version 2.7.1pre1-2008-02-19-171553 starting.
> [E 02/22 13:44] max send/recv sge 14 15
> [E 02/22 13:52] Error: encourage_recv_incoming: mop_id 10164ae0 in RTS_DONE
> message not found.
The other side did an RDMA, then sent a message saying "the rdma is
done" and referencing the given mop_id. This side is complaining
it doesn't know about any such mop_id. There's really not much else
to do here but die. How did this side forget about the mop_id? Did
the other side send a duplicate done message? Any of these things
would be bugs.
Perhaps a cancelled message on the receiver might lead to some sort
of breakage here. You probably would have logs talking about that.
You could add more debug to this loop
rq = NULL;
qlist_for_each_entry(rqt, &ib_device->recvq, list) {
if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id &&
rqt->state.recv == RQ_RTS_WAITING_RTS_DONE) {
rq = rqt;
break;
}
}
to see if it knows about the mop_id but is in the wrong state. Be
sure not to break, then, as multiple rqt may have the same mop_id,
but no more than one should be waiting for the rts done.
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers