[Pvfs2-developers] Re: openib-vfs failure

Pete Wyckoff Wed, 06 Feb 2008 10:44:35 -0800

[EMAIL PROTECTED] wrote on Fri, 01 Feb 2008 19:07 -0600:
> Troy Benjegerdes wrote:
>>
>>> Pete -
>>>
>>> I've attached a link to a log of the failure with network debugging on in
>>> the client, single IO node.  The whole log is 5.9GB so I only attached 
>>> the
>>> last 10k lines.  Same error as before of course.
>>>
>>> http://www.scl.ameslab.gov/~kschoche/pvfs2-client.log.gz
>>>
>>> The mopids are fairly difficult to track as they are used all over the
>>> place and end up here and there, I cant make out anything useful from it
>>> :'(
>>>
>>> Any advice would be great,
>>>
>>> ~Kyle
>>>
>>>   
>> Here is another, full logfile of another failure: (93M compressed, 1GB 
>> unpacked)
>>
>> http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client.log.gz
>
> Kind of an update....
> after doing some tracing of function calls and trying to figure out why the 
> same "mop_id" was used 10,000+ times during my failed run, troy and I 
> stumbled upon some of the fmr code.. and after changing the 
> id_gen_fast(mopid) functions to use the id_gen_safe(mopid)  functions in 
> id_generator.c...  We have possibly fixed the problem, however, this does 
> introduce some amount of overhead.  I'll attempt to do some tests to 
> quantify the exact amount next week, but for now it seems to at least allow 
> my tests to complete.
>
> Maybe something is wrong with the id_gen_fast() stuff, locking or other 
> issues maybe?
>
> Troy and I had some questions about how these mop_id's, which are just 
> addresses, are generated, and whether or not there is the possibility for 
> two I/O servers to generate the same address, and send that to the client 
> somehow?
> Can you give us a brief description of the process Pete?


Sorry for the delay here.  You may be on to something there.

The mop_id is a number assigned at each send or recv operation by
the caller, e.g. the client.  It is used internally in BMI/PVFS to
track outstanding operations.  This is true for all the transports.

In IB, I thought I'd be clever and use it as the tag in RTS
operations, so that an incoming CTS from the server would echo the
mop_id that the client sent in its RTS, and the client can use that
to lookup what message the server is saying is clear to send.

The difference between "fast" and "safe", is fast just casts the
pointer to an int, done.  Safe uses cross-thread locking and adds
one to a unique counter.

So the reason you see lots of the same mop_ids is tied to malloc()
returning the same block over and over.  This should be perfectly
fine, and good to keep caches hot.  There should be only one send or
recv state structure per mop_id, guaranteeing uniqueness, but
perhaps something is broken in here.

There could be a multi-server problem, but a check on sq->c is
supposed to be everywhere, and looks okay from here.

The sender goes through these phases for a message that needs RTS:

    post_send()
        assign mop_id
        SQ_WAITING_BUFFER
    encourage_send_waiting_buffer()
        send RTS with mop_id
        SQ_WAITING_RTS_SEND_COMPLETION
    ib_check_cq()
        get mop_id back
        SQ_WAITING_CTS
    encourage_send_incoming_cts()
        get mop_id, lookup send state structure (sq)
        send data
        SQ_WAITING_DATA_SEND_COMPLETION
    ib_check_cq()
        SQ_WAITING_RTS_DONE_BUFFER
    encourage_rts_done_waiting_buffer()
        send done message with mop_id
        SQ_WAITING_RTS_DONE_SEND_COMPLETION
    ib_check_cq()
        SQ_WAITING_USER_TEST
    test() et al
        free mop_id

The receiver goes like this.  Case preposted:

    post_recv()
        assign mop_id
        RQ_WAITING_INCOMING
    ib_check_cq()
        (his rts arrives)
        assign rq->rts_mop_id with sender's mop_id
        RQ_RTS_WAITING_CTS_BUFFER
        goto common path

Case non-preposted:

    ib_check_cq()
        (his rts arrives)
        alloc new recv state rq
        assign rq->rts_mop_id with sender's mop_id
        RQ_RTS_WAITING_USER_POST
    post_recv()
        assign our mop_id to existing rq
        goto common path

Common recv path:

    send_cts()
        send cts with sender's mop_id
        RQ_RTS_WAITING_CTS_SEND_COMPLETION
    ib_check_cq()
        (rts send completed)
        RQ_RTS_WAITING_RTS_DONE
    ib_check_cq()
        (his done message arrived)
        compare sender's mop_id against his stored mop_id
        RQ_RTS_WAITING_USER_TEST
    test() et al
        free mop_id

The bug can be seen from your trace.  The client posts 4 recvs:  8k,
1200k, 8k, 8k.  The sender sends something to match recv 1 mopid 1,
then recv 3, mopid 3, then recv 4 with mopid 4.  But mopid 1 ==
mopid 4.  This is valid as the server could have done a "test" and
retired the mopid, only to use it again for send 4.  But to the
client, this mopid is still on the "active recv" list twice.

Your trace, by the way, has lots of 8192-byte sends.  These end up
going RTS rather than eager, as the limit is like 8170.  You might
want to bump up DEFAULT_EAGER_BUF_SIZE in ib.h to get better
performance.

Here's a patch (untested), with a bit more debugging.  Let me know
if it works and I'll check it in.  Thanks for tracking this down.

                -- Pete

Index: src/io/bmi/bmi_ib/ib.c
===================================================================
RCS file: /projects/cvsroot/pvfs2/src/io/bmi/bmi_ib/ib.c,v
retrieving revision 1.62
diff -u -p -r1.62 ib.c
--- src/io/bmi/bmi_ib/ib.c      5 Dec 2007 20:03:17 -0000       1.62
+++ src/io/bmi/bmi_ib/ib.c      6 Feb 2008 18:40:09 -0000
@@ -217,8 +217,9 @@ static int ib_check_cq(void)
                    assert(0, "%s: unknown send state %s (%d) of sq %p",
                           __func__, sq_state_name(sq->state.send),
                           sq->state.send, sq);
-               debug(2, "%s: send to %s completed locally: -> %s",
-                     __func__, bh->c->peername, sq_state_name(sq->state.send));
+               debug(2, "%s: send to %s completed locally: sq %p -> %s",
+                     __func__, bh->c->peername, sq,
+                     sq_state_name(sq->state.send));
 
            } else {
                struct ib_work *rq = sq;  /* rename */
@@ -229,8 +230,9 @@ static int ib_check_cq(void)
                else
                    assert(0, "%s: unknown send state %s of rq %p",
                           __func__, rq_state_name(rq->state.recv), rq);
-               debug(2, "%s: send to %s completed locally: -> %s",
-                     __func__, bh->c->peername, rq_state_name(rq->state.recv));
+               debug(2, "%s: send to %s completed locally: rq %p -> %s",
+                     __func__, bh->c->peername, rq,
+                     rq_state_name(rq->state.recv));
            }
 
            qlist_add_tail(&bh->list, &bh->c->eager_send_buf_free);
@@ -599,7 +601,8 @@ encourage_recv_incoming(struct buf_head 
 
        rq = NULL;
        qlist_for_each_entry(rqt, &ib_device->recvq, list) {
-           if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id) {
+           if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id &&
+               rqt->state.recv == RQ_RTS_WAITING_RTS_DONE) {
                rq = rqt;
                break;
            }
@@ -607,9 +610,6 @@ encourage_recv_incoming(struct buf_head 
 
        assert(rq, "%s: mop_id %llx in RTS_DONE message not found",
               __func__, llu(mh_rts_done.mop_id));
-       assert(rq->state.recv == RQ_RTS_WAITING_RTS_DONE,
-              "%s: RTS_DONE to rq wrong state %s",
-              __func__, rq_state_name(rq->state.recv));
 
 #if MEMCACHE_BOUNCEBUF
        memcpy_to_buflist(&rq->buflist, reg_recv_buflist_buf,
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

[Pvfs2-developers] Re: openib-vfs failure

Reply via email to