[EMAIL PROTECTED] wrote on Tue, 29 Jan 2008 16:09 -0600:
> I've been running GAMESS tests with about 160GB's on the filesystem trying
> to stress the network a bit and have managed to reproducibly get the
> pvfs2-client to end on an assertion failure in
> "src/io/bmi/bmi_ib/ib.c:611"
> 
> I havent been able to figure out exactly what is occuring that is causing
> this assertion failure, but from the code it really appears as if this
> shouldnt ever be occuring, obviously (assertion) ;)  Maybe we're getting
> duplicate messages or double-testing a message somehow.
> 
> I'm running cvs HEAD, debian 2.6.18, and using bmi_ib modules over the vfs.
> 
> [E 15:54:20.197159] Error: encourage_recv_incoming: RTS_DONE to rq wrong
> state RQ_RTS_WAITING_USER_TEST.
> [E 15:54:20.200927]     [bt] pvfs2-client-core(error+0xca) [0x41a2ba]
> [E 15:54:20.200940]     [bt] pvfs2-client-core [0x41779f]
> [E 15:54:20.200948]     [bt] pvfs2-client-core [0x417e3a]
> [E 15:54:20.200955]     [bt] pvfs2-client-core [0x4181fd]
> [E 15:54:20.200963]     [bt] pvfs2-client-core(job_bmi_recv+0xea) [0x422f0a]
> [E 15:54:20.200971]     [bt] pvfs2-client-core [0x441a18]
> [E 15:54:20.200978]     [bt]
> pvfs2-client-core(PINT_state_machine_invoke+0xd2) [
> 0x431be2]
> [E 15:54:20.200986]     [bt]
> pvfs2-client-core(PINT_state_machine_next+0xcc) [0x
> 43198c]
> [E 15:54:20.200994]     [bt]
> pvfs2-client-core(PINT_client_state_machine_post+0x
> 99) [0x4383e9]
> [E 15:54:20.201001]     [bt] pvfs2-client-core(PVFS_isys_io+0x324) [0x4430a4]
> [E 15:54:20.201009]     [bt] pvfs2-client-core [0x4117a6]
> [E 15:54:20.205453] pvfs2-client-core with pid 6251 exited with value 1

That is indeed scary.  The server has sent MSG_RTS_DONE to the
client.  The client looks up the mop_id (64-bit number in header)
and finds it corresponds to a message that it thought had already
been completed.  The message is in "waiting user test" which means
IB is all done, it just is waiting for the upper layers to ask for
the completion status.

You could turn on debugging, level 2, which I think is the default.
Enable it on the client core by starting it up with

    pvsf2-client --gossip-mask=network

then look at the /tmp/pvfs-client.log (or whatever I forget) and try
to find some patterns.  You will see these messages:

        debug(2, "%s: recv RTS_DONE mop_id %llx", __func__,

whenever the client gets a MSG_RTS_DONE.  If you see a duplicate
mop_id (or not) before your assert, that will help us narrow the
problem.

You can also turn debugging on the server, with "pvfs2-set-debugmask
-m /pvfs network", and watch him say he has sent RTS_DONE of certain
mopid.  I don't think this will add any information yet, but fyi.

Probably easier to deal with all this on a single-server setup, if
possible.

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to