[EMAIL PROTECTED] wrote on Fri, 01 Feb 2008 19:07 -0600:
> Troy Benjegerdes wrote:
>>
>>> Pete -
>>>
>>> I've attached a link to a log of the failure with network debugging on in
>>> the client, single IO node. The whole log is 5.9GB so I only attached
>>> the
>>> last 10k lines. Same error as before of course.
>>>
>>> http://www.scl.ameslab.gov/~kschoche/pvfs2-client.log.gz
>>>
>>> The mopids are fairly difficult to track as they are used all over the
>>> place and end up here and there, I cant make out anything useful from it
>>> :'(
>>>
>>> Any advice would be great,
>>>
>>> ~Kyle
>>>
>>>
>> Here is another, full logfile of another failure: (93M compressed, 1GB
>> unpacked)
>>
>> http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client.log.gz
>
> Kind of an update....
> after doing some tracing of function calls and trying to figure out why the
> same "mop_id" was used 10,000+ times during my failed run, troy and I
> stumbled upon some of the fmr code.. and after changing the
> id_gen_fast(mopid) functions to use the id_gen_safe(mopid) functions in
> id_generator.c... We have possibly fixed the problem, however, this does
> introduce some amount of overhead. I'll attempt to do some tests to
> quantify the exact amount next week, but for now it seems to at least allow
> my tests to complete.
>
> Maybe something is wrong with the id_gen_fast() stuff, locking or other
> issues maybe?
>
> Troy and I had some questions about how these mop_id's, which are just
> addresses, are generated, and whether or not there is the possibility for
> two I/O servers to generate the same address, and send that to the client
> somehow?
> Can you give us a brief description of the process Pete?
Sorry for the delay here. You may be on to something there.
The mop_id is a number assigned at each send or recv operation by
the caller, e.g. the client. It is used internally in BMI/PVFS to
track outstanding operations. This is true for all the transports.
In IB, I thought I'd be clever and use it as the tag in RTS
operations, so that an incoming CTS from the server would echo the
mop_id that the client sent in its RTS, and the client can use that
to lookup what message the server is saying is clear to send.
The difference between "fast" and "safe", is fast just casts the
pointer to an int, done. Safe uses cross-thread locking and adds
one to a unique counter.
So the reason you see lots of the same mop_ids is tied to malloc()
returning the same block over and over. This should be perfectly
fine, and good to keep caches hot. There should be only one send or
recv state structure per mop_id, guaranteeing uniqueness, but
perhaps something is broken in here.
There could be a multi-server problem, but a check on sq->c is
supposed to be everywhere, and looks okay from here.
The sender goes through these phases for a message that needs RTS:
post_send()
assign mop_id
SQ_WAITING_BUFFER
encourage_send_waiting_buffer()
send RTS with mop_id
SQ_WAITING_RTS_SEND_COMPLETION
ib_check_cq()
get mop_id back
SQ_WAITING_CTS
encourage_send_incoming_cts()
get mop_id, lookup send state structure (sq)
send data
SQ_WAITING_DATA_SEND_COMPLETION
ib_check_cq()
SQ_WAITING_RTS_DONE_BUFFER
encourage_rts_done_waiting_buffer()
send done message with mop_id
SQ_WAITING_RTS_DONE_SEND_COMPLETION
ib_check_cq()
SQ_WAITING_USER_TEST
test() et al
free mop_id
The receiver goes like this. Case preposted:
post_recv()
assign mop_id
RQ_WAITING_INCOMING
ib_check_cq()
(his rts arrives)
assign rq->rts_mop_id with sender's mop_id
RQ_RTS_WAITING_CTS_BUFFER
goto common path
Case non-preposted:
ib_check_cq()
(his rts arrives)
alloc new recv state rq
assign rq->rts_mop_id with sender's mop_id
RQ_RTS_WAITING_USER_POST
post_recv()
assign our mop_id to existing rq
goto common path
Common recv path:
send_cts()
send cts with sender's mop_id
RQ_RTS_WAITING_CTS_SEND_COMPLETION
ib_check_cq()
(rts send completed)
RQ_RTS_WAITING_RTS_DONE
ib_check_cq()
(his done message arrived)
compare sender's mop_id against his stored mop_id
RQ_RTS_WAITING_USER_TEST
test() et al
free mop_id
The bug can be seen from your trace. The client posts 4 recvs: 8k,
1200k, 8k, 8k. The sender sends something to match recv 1 mopid 1,
then recv 3, mopid 3, then recv 4 with mopid 4. But mopid 1 ==
mopid 4. This is valid as the server could have done a "test" and
retired the mopid, only to use it again for send 4. But to the
client, this mopid is still on the "active recv" list twice.
Your trace, by the way, has lots of 8192-byte sends. These end up
going RTS rather than eager, as the limit is like 8170. You might
want to bump up DEFAULT_EAGER_BUF_SIZE in ib.h to get better
performance.
Here's a patch (untested), with a bit more debugging. Let me know
if it works and I'll check it in. Thanks for tracking this down.
-- Pete
Index: src/io/bmi/bmi_ib/ib.c
===================================================================
RCS file: /projects/cvsroot/pvfs2/src/io/bmi/bmi_ib/ib.c,v
retrieving revision 1.62
diff -u -p -r1.62 ib.c
--- src/io/bmi/bmi_ib/ib.c 5 Dec 2007 20:03:17 -0000 1.62
+++ src/io/bmi/bmi_ib/ib.c 6 Feb 2008 18:40:09 -0000
@@ -217,8 +217,9 @@ static int ib_check_cq(void)
assert(0, "%s: unknown send state %s (%d) of sq %p",
__func__, sq_state_name(sq->state.send),
sq->state.send, sq);
- debug(2, "%s: send to %s completed locally: -> %s",
- __func__, bh->c->peername, sq_state_name(sq->state.send));
+ debug(2, "%s: send to %s completed locally: sq %p -> %s",
+ __func__, bh->c->peername, sq,
+ sq_state_name(sq->state.send));
} else {
struct ib_work *rq = sq; /* rename */
@@ -229,8 +230,9 @@ static int ib_check_cq(void)
else
assert(0, "%s: unknown send state %s of rq %p",
__func__, rq_state_name(rq->state.recv), rq);
- debug(2, "%s: send to %s completed locally: -> %s",
- __func__, bh->c->peername, rq_state_name(rq->state.recv));
+ debug(2, "%s: send to %s completed locally: rq %p -> %s",
+ __func__, bh->c->peername, rq,
+ rq_state_name(rq->state.recv));
}
qlist_add_tail(&bh->list, &bh->c->eager_send_buf_free);
@@ -599,7 +601,8 @@ encourage_recv_incoming(struct buf_head
rq = NULL;
qlist_for_each_entry(rqt, &ib_device->recvq, list) {
- if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id) {
+ if (rqt->c == c && rqt->rts_mop_id == mh_rts_done.mop_id &&
+ rqt->state.recv == RQ_RTS_WAITING_RTS_DONE) {
rq = rqt;
break;
}
@@ -607,9 +610,6 @@ encourage_recv_incoming(struct buf_head
assert(rq, "%s: mop_id %llx in RTS_DONE message not found",
__func__, llu(mh_rts_done.mop_id));
- assert(rq->state.recv == RQ_RTS_WAITING_RTS_DONE,
- "%s: RTS_DONE to rq wrong state %s",
- __func__, rq_state_name(rq->state.recv));
#if MEMCACHE_BOUNCEBUF
memcpy_to_buflist(&rq->buflist, reg_recv_buflist_buf,
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers