Pete - I applied your patch, and got the following immediately on some io:
[E 02/06 14:12] Error: openib_check_cq: unknown opcode 11171. [E 02/06 14:12] [bt] bin/sbin/pvfs2-server(error+0xca) [0x4293ba] [E 02/06 14:12] [bt] bin/sbin/pvfs2-server [0x429f9b] [E 02/06 14:12] [bt] bin/sbin/pvfs2-server [0x425dc3] [E 02/06 14:12] [bt] bin/sbin/pvfs2-server [0x428129] [E 02/06 14:12] [bt] bin/sbin/pvfs2-server(BMI_testunexpected+0x19f) [0x 424d2f] [E 02/06 14:12] [bt] bin/sbin/pvfs2-server [0x4397a7] [E 02/06 14:12] [bt] /lib/libpthread.so.0 [0x2ba396d2df1a] [E 02/06 14:12] [bt] /lib/libc.so.6(__clone+0x72) [0x2ba39711c602] On Feb 6, 2008 12:43 PM, Pete Wyckoff <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote on Fri, 01 Feb 2008 19:07 -0600: > > > Troy Benjegerdes wrote: > >> > >>> Pete - > >>> > >>> I've attached a link to a log of the failure with network debugging on in > >>> the client, single IO node. The whole log is 5.9GB so I only attached > >>> the > >>> last 10k lines. Same error as before of course. > >>> > >>> http://www.scl.ameslab.gov/~kschoche/pvfs2-client.log.gz > >>> > >>> The mopids are fairly difficult to track as they are used all over the > >>> place and end up here and there, I cant make out anything useful from it > >>> :'( > >>> > >>> Any advice would be great, > >>> > >>> ~Kyle > >>> > >>> > >> Here is another, full logfile of another failure: (93M compressed, 1GB > >> unpacked) > >> > >> http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client.log.gz > > > > Kind of an update.... > > after doing some tracing of function calls and trying to figure out why the > > same "mop_id" was used 10,000+ times during my failed run, troy and I > > stumbled upon some of the fmr code.. and after changing the > > id_gen_fast(mopid) functions to use the id_gen_safe(mopid) functions in > > id_generator.c... We have possibly fixed the problem, however, this does > > introduce some amount of overhead. I'll attempt to do some tests to > > quantify the exact amount next week, but for now it seems to at least allow > > my tests to complete. > > > > Maybe something is wrong with the id_gen_fast() stuff, locking or other > > issues maybe? > > > > Troy and I had some questions about how these mop_id's, which are just > > addresses, are generated, and whether or not there is the possibility for > > two I/O servers to generate the same address, and send that to the client > > somehow? > > Can you give us a brief description of the process Pete? > > Sorry for the delay here. You may be on to something there. > > The mop_id is a number assigned at each send or recv operation by > the caller, e.g. the client. It is used internally in BMI/PVFS to > track outstanding operations. This is true for all the transports. > > In IB, I thought I'd be clever and use it as the tag in RTS > operations, so that an incoming CTS from the server would echo the > mop_id that the client sent in its RTS, and the client can use that > to lookup what message the server is saying is clear to send. > > The difference between "fast" and "safe", is fast just casts the > pointer to an int, done. Safe uses cross-thread locking and adds > one to a unique counter. > > So the reason you see lots of the same mop_ids is tied to malloc() > returning the same block over and over. This should be perfectly > fine, and good to keep caches hot. There should be only one send or > recv state structure per mop_id, guaranteeing uniqueness, but > perhaps something is broken in here. > > There could be a multi-server problem, but a check on sq->c is > supposed to be everywhere, and looks okay from here. > > The sender goes through these phases for a message that needs RTS: > > post_send() > assign mop_id > SQ_WAITING_BUFFER > encourage_send_waiting_buffer() > send RTS with mop_id > SQ_WAITING_RTS_SEND_COMPLETION > ib_check_cq() > get mop_id back > SQ_WAITING_CTS > encourage_send_incoming_cts() > get mop_id, lookup send state structure (sq) > send data > SQ_WAITING_DATA_SEND_COMPLETION > ib_check_cq() > SQ_WAITING_RTS_DONE_BUFFER > encourage_rts_done_waiting_buffer() > send done message with mop_id > SQ_WAITING_RTS_DONE_SEND_COMPLETION > ib_check_cq() > SQ_WAITING_USER_TEST > test() et al > free mop_id > > The receiver goes like this. Case preposted: > > post_recv() > assign mop_id > RQ_WAITING_INCOMING > ib_check_cq() > (his rts arrives) > assign rq->rts_mop_id with sender's mop_id > RQ_RTS_WAITING_CTS_BUFFER > goto common path > > Case non-preposted: > > ib_check_cq() > (his rts arrives) > alloc new recv state rq > assign rq->rts_mop_id with sender's mop_id > RQ_RTS_WAITING_USER_POST > post_recv() > assign our mop_id to existing rq > goto common path > > Common recv path: > > send_cts() > send cts with sender's mop_id > RQ_RTS_WAITING_CTS_SEND_COMPLETION > ib_check_cq() > (rts send completed) > RQ_RTS_WAITING_RTS_DONE > ib_check_cq() > (his done message arrived) > compare sender's mop_id against his stored mop_id > RQ_RTS_WAITING_USER_TEST > test() et al > free mop_id > > The bug can be seen from your trace. The client posts 4 recvs: 8k, > 1200k, 8k, 8k. The sender sends something to match recv 1 mopid 1, > then recv 3, mopid 3, then recv 4 with mopid 4. But mopid 1 == > mopid 4. This is valid as the server could have done a "test" and > retired the mopid, only to use it again for send 4. But to the > client, this mopid is still on the "active recv" list twice. > > Your trace, by the way, has lots of 8192-byte sends. These end up > going RTS rather than eager, as the limit is like 8170. You might > want to bump up DEFAULT_EAGER_BUF_SIZE in ib.h to get better > performance. > > Here's a patch (untested), with a bit more debugging. Let me know > if it works and I'll check it in. Thanks for tracking this down. > > -- Pete > > Index: src/io/bmi/bmi_ib/ib.c > =================================================================== > RCS file: /projects/cvsroot/pvfs2/src/io/bmi/bmi_ib/ib.c,v > retrieving revision 1.62 > diff -u -p -r1.62 ib.c > --- src/io/bmi/bmi_ib/ib.c 5 Dec 2007 20:03:17 -0000 1.62 > +++ src/io/bmi/bmi_ib/ib.c 6 Feb 2008 18:40:09 -0000 > @@ -217,8 +217,9 @@ static int ib_check_cq(void) > assert(0, "%s: unknown send state %s (%d) of sq %p", > __func__, sq_state_name(sq->state.send), > sq->state.send, sq); > - debug(2, "%s: send to %s completed locally: -> %s", > - __func__, bh->c->peername, > sq_state_name(sq->state.send)); > + debug(2, "%s: send to %s completed locally: sq %p -> %s", > + __func__, bh->c->peername, sq, > + sq_state_name(sq->state.send)); > > } else { > struct ib_work *rq = sq; /* rename */ > @@ -229,8 +230,9 @@ static int ib_check_cq(void) > else > assert(0, "%s: unknown send state %s of rq %p", > __func__, rq_state_name(rq->state.recv), rq); > - debug(2, "%s: send to %s completed locally: -> %s", > - __func__, bh->c->peername, > rq_state_name(rq->state.recv)); > + debug(2, "%s: send to %s completed locally: rq %p -> %s", > + __func__, bh->c->peername, rq, > + rq_state_name(rq->state.recv)); > } > > qlist_add_tail(&bh->list, &bh->c->eager_send_buf_free); > @@ -599,7 +601,8 @@ encourage_recv_incoming(struct buf_head > > rq = NULL; > qlist_for_each_entry(rqt, &ib_device->recvq, list) { > - if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id) { > + if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id && > + rqt->state.recv == RQ_RTS_WAITING_RTS_DONE) { > rq = rqt; > break; > } > @@ -607,9 +610,6 @@ encourage_recv_incoming(struct buf_head > > assert(rq, "%s: mop_id %llx in RTS_DONE message not found", > __func__, llu(mh_rts_done.mop_id)); > - assert(rq->state.recv == RQ_RTS_WAITING_RTS_DONE, > - "%s: RTS_DONE to rq wrong state %s", > - __func__, rq_state_name(rq->state.recv)); > > #if MEMCACHE_BOUNCEBUF > memcpy_to_buflist(&rq->buflist, reg_recv_buflist_buf, > > _______________________________________________ > Pvfs2-developers mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > -- Kyle Schochenmaier _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
