Re: [Pvfs2-developers] Re: openib-vfs failure

Kyle Schochenmaier Wed, 06 Feb 2008 12:14:44 -0800

Pete -
I applied your patch, and got the following immediately on some io:


[E 02/06 14:12] Error: openib_check_cq: unknown opcode 11171.
[E 02/06 14:12]         [bt] bin/sbin/pvfs2-server(error+0xca) [0x4293ba]
[E 02/06 14:12]         [bt] bin/sbin/pvfs2-server [0x429f9b]
[E 02/06 14:12]         [bt] bin/sbin/pvfs2-server [0x425dc3]
[E 02/06 14:12]         [bt] bin/sbin/pvfs2-server [0x428129]
[E 02/06 14:12]         [bt] bin/sbin/pvfs2-server(BMI_testunexpected+0x19f) [0x
424d2f]
[E 02/06 14:12]         [bt] bin/sbin/pvfs2-server [0x4397a7]
[E 02/06 14:12]         [bt] /lib/libpthread.so.0 [0x2ba396d2df1a]
[E 02/06 14:12]         [bt] /lib/libc.so.6(__clone+0x72) [0x2ba39711c602]


On Feb 6, 2008 12:43 PM, Pete Wyckoff <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote on Fri, 01 Feb 2008 19:07 -0600:
>
> > Troy Benjegerdes wrote:
> >>
> >>> Pete -
> >>>
> >>> I've attached a link to a log of the failure with network debugging on in
> >>> the client, single IO node.  The whole log is 5.9GB so I only attached
> >>> the
> >>> last 10k lines.  Same error as before of course.
> >>>
> >>> http://www.scl.ameslab.gov/~kschoche/pvfs2-client.log.gz
> >>>
> >>> The mopids are fairly difficult to track as they are used all over the
> >>> place and end up here and there, I cant make out anything useful from it
> >>> :'(
> >>>
> >>> Any advice would be great,
> >>>
> >>> ~Kyle
> >>>
> >>>
> >> Here is another, full logfile of another failure: (93M compressed, 1GB
> >> unpacked)
> >>
> >> http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client.log.gz
> >
> > Kind of an update....
> > after doing some tracing of function calls and trying to figure out why the
> > same "mop_id" was used 10,000+ times during my failed run, troy and I
> > stumbled upon some of the fmr code.. and after changing the
> > id_gen_fast(mopid) functions to use the id_gen_safe(mopid)  functions in
> > id_generator.c...  We have possibly fixed the problem, however, this does
> > introduce some amount of overhead.  I'll attempt to do some tests to
> > quantify the exact amount next week, but for now it seems to at least allow
> > my tests to complete.
> >
> > Maybe something is wrong with the id_gen_fast() stuff, locking or other
> > issues maybe?
> >
> > Troy and I had some questions about how these mop_id's, which are just
> > addresses, are generated, and whether or not there is the possibility for
> > two I/O servers to generate the same address, and send that to the client
> > somehow?
> > Can you give us a brief description of the process Pete?
>
> Sorry for the delay here.  You may be on to something there.
>
> The mop_id is a number assigned at each send or recv operation by
> the caller, e.g. the client.  It is used internally in BMI/PVFS to
> track outstanding operations.  This is true for all the transports.
>
> In IB, I thought I'd be clever and use it as the tag in RTS
> operations, so that an incoming CTS from the server would echo the
> mop_id that the client sent in its RTS, and the client can use that
> to lookup what message the server is saying is clear to send.
>
> The difference between "fast" and "safe", is fast just casts the
> pointer to an int, done.  Safe uses cross-thread locking and adds
> one to a unique counter.
>
> So the reason you see lots of the same mop_ids is tied to malloc()
> returning the same block over and over.  This should be perfectly
> fine, and good to keep caches hot.  There should be only one send or
> recv state structure per mop_id, guaranteeing uniqueness, but
> perhaps something is broken in here.
>
> There could be a multi-server problem, but a check on sq->c is
> supposed to be everywhere, and looks okay from here.
>
> The sender goes through these phases for a message that needs RTS:
>
>     post_send()
>         assign mop_id
>         SQ_WAITING_BUFFER
>     encourage_send_waiting_buffer()
>         send RTS with mop_id
>         SQ_WAITING_RTS_SEND_COMPLETION
>     ib_check_cq()
>         get mop_id back
>         SQ_WAITING_CTS
>     encourage_send_incoming_cts()
>         get mop_id, lookup send state structure (sq)
>         send data
>         SQ_WAITING_DATA_SEND_COMPLETION
>     ib_check_cq()
>         SQ_WAITING_RTS_DONE_BUFFER
>     encourage_rts_done_waiting_buffer()
>         send done message with mop_id
>         SQ_WAITING_RTS_DONE_SEND_COMPLETION
>     ib_check_cq()
>         SQ_WAITING_USER_TEST
>     test() et al
>         free mop_id
>
> The receiver goes like this.  Case preposted:
>
>     post_recv()
>         assign mop_id
>         RQ_WAITING_INCOMING
>     ib_check_cq()
>         (his rts arrives)
>         assign rq->rts_mop_id with sender's mop_id
>         RQ_RTS_WAITING_CTS_BUFFER
>         goto common path
>
> Case non-preposted:
>
>     ib_check_cq()
>         (his rts arrives)
>         alloc new recv state rq
>         assign rq->rts_mop_id with sender's mop_id
>         RQ_RTS_WAITING_USER_POST
>     post_recv()
>         assign our mop_id to existing rq
>         goto common path
>
> Common recv path:
>
>     send_cts()
>         send cts with sender's mop_id
>         RQ_RTS_WAITING_CTS_SEND_COMPLETION
>     ib_check_cq()
>         (rts send completed)
>         RQ_RTS_WAITING_RTS_DONE
>     ib_check_cq()
>         (his done message arrived)
>         compare sender's mop_id against his stored mop_id
>         RQ_RTS_WAITING_USER_TEST
>     test() et al
>         free mop_id
>
> The bug can be seen from your trace.  The client posts 4 recvs:  8k,
> 1200k, 8k, 8k.  The sender sends something to match recv 1 mopid 1,
> then recv 3, mopid 3, then recv 4 with mopid 4.  But mopid 1 ==
> mopid 4.  This is valid as the server could have done a "test" and
> retired the mopid, only to use it again for send 4.  But to the
> client, this mopid is still on the "active recv" list twice.
>
> Your trace, by the way, has lots of 8192-byte sends.  These end up
> going RTS rather than eager, as the limit is like 8170.  You might
> want to bump up DEFAULT_EAGER_BUF_SIZE in ib.h to get better
> performance.
>
> Here's a patch (untested), with a bit more debugging.  Let me know
> if it works and I'll check it in.  Thanks for tracking this down.
>
>                 -- Pete
>
> Index: src/io/bmi/bmi_ib/ib.c
> ===================================================================
> RCS file: /projects/cvsroot/pvfs2/src/io/bmi/bmi_ib/ib.c,v
> retrieving revision 1.62
> diff -u -p -r1.62 ib.c
> --- src/io/bmi/bmi_ib/ib.c      5 Dec 2007 20:03:17 -0000       1.62
> +++ src/io/bmi/bmi_ib/ib.c      6 Feb 2008 18:40:09 -0000
> @@ -217,8 +217,9 @@ static int ib_check_cq(void)
>                     assert(0, "%s: unknown send state %s (%d) of sq %p",
>                            __func__, sq_state_name(sq->state.send),
>                            sq->state.send, sq);
> -               debug(2, "%s: send to %s completed locally: -> %s",
> -                     __func__, bh->c->peername, 
> sq_state_name(sq->state.send));
> +               debug(2, "%s: send to %s completed locally: sq %p -> %s",
> +                     __func__, bh->c->peername, sq,
> +                     sq_state_name(sq->state.send));
>
>             } else {
>                 struct ib_work *rq = sq;  /* rename */
> @@ -229,8 +230,9 @@ static int ib_check_cq(void)
>                 else
>                     assert(0, "%s: unknown send state %s of rq %p",
>                            __func__, rq_state_name(rq->state.recv), rq);
> -               debug(2, "%s: send to %s completed locally: -> %s",
> -                     __func__, bh->c->peername, 
> rq_state_name(rq->state.recv));
> +               debug(2, "%s: send to %s completed locally: rq %p -> %s",
> +                     __func__, bh->c->peername, rq,
> +                     rq_state_name(rq->state.recv));
>             }
>
>             qlist_add_tail(&bh->list, &bh->c->eager_send_buf_free);
> @@ -599,7 +601,8 @@ encourage_recv_incoming(struct buf_head
>
>         rq = NULL;
>         qlist_for_each_entry(rqt, &ib_device->recvq, list) {
> -           if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id) {
> +           if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id &&
> +               rqt->state.recv == RQ_RTS_WAITING_RTS_DONE) {
>                 rq = rqt;
>                 break;
>             }
> @@ -607,9 +610,6 @@ encourage_recv_incoming(struct buf_head
>
>         assert(rq, "%s: mop_id %llx in RTS_DONE message not found",
>                __func__, llu(mh_rts_done.mop_id));
> -       assert(rq->state.recv == RQ_RTS_WAITING_RTS_DONE,
> -              "%s: RTS_DONE to rq wrong state %s",
> -              __func__, rq_state_name(rq->state.recv));
>
>  #if MEMCACHE_BOUNCEBUF
>         memcpy_to_buflist(&rq->buflist, reg_recv_buflist_buf,
>
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>



-- 
Kyle Schochenmaier
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] Re: openib-vfs failure

Reply via email to