> [EMAIL PROTECTED] wrote on Tue, 29 Jan 2008 16:09 -0600:
>> I've been running GAMESS tests with about 160GB's on the filesystem
>> trying
>> to stress the network a bit and have managed to reproducibly get the
>> pvfs2-client to end on an assertion failure in
>> "src/io/bmi/bmi_ib/ib.c:611"
>>
>> I havent been able to figure out exactly what is occuring that is
>> causing
>> this assertion failure, but from the code it really appears as if this
>> shouldnt ever be occuring, obviously (assertion) ;)  Maybe we're getting
>> duplicate messages or double-testing a message somehow.
>>
>> I'm running cvs HEAD, debian 2.6.18, and using bmi_ib modules over the
>> vfs.
>>
>> [E 15:54:20.197159] Error: encourage_recv_incoming: RTS_DONE to rq wrong
>> state RQ_RTS_WAITING_USER_TEST.
>> [E 15:54:20.200927]     [bt] pvfs2-client-core(error+0xca) [0x41a2ba]
>> [E 15:54:20.200940]     [bt] pvfs2-client-core [0x41779f]
>> [E 15:54:20.200948]     [bt] pvfs2-client-core [0x417e3a]
>> [E 15:54:20.200955]     [bt] pvfs2-client-core [0x4181fd]
>> [E 15:54:20.200963]     [bt] pvfs2-client-core(job_bmi_recv+0xea)
>> [0x422f0a]
>> [E 15:54:20.200971]     [bt] pvfs2-client-core [0x441a18]
>> [E 15:54:20.200978]     [bt]
>> pvfs2-client-core(PINT_state_machine_invoke+0xd2) [
>> 0x431be2]
>> [E 15:54:20.200986]     [bt]
>> pvfs2-client-core(PINT_state_machine_next+0xcc) [0x
>> 43198c]
>> [E 15:54:20.200994]     [bt]
>> pvfs2-client-core(PINT_client_state_machine_post+0x
>> 99) [0x4383e9]
>> [E 15:54:20.201001]     [bt] pvfs2-client-core(PVFS_isys_io+0x324)
>> [0x4430a4]
>> [E 15:54:20.201009]     [bt] pvfs2-client-core [0x4117a6]
>> [E 15:54:20.205453] pvfs2-client-core with pid 6251 exited with value 1
>
> That is indeed scary.  The server has sent MSG_RTS_DONE to the
> client.  The client looks up the mop_id (64-bit number in header)
> and finds it corresponds to a message that it thought had already
> been completed.  The message is in "waiting user test" which means
> IB is all done, it just is waiting for the upper layers to ask for
> the completion status.
>
> You could turn on debugging, level 2, which I think is the default.
> Enable it on the client core by starting it up with
>
>     pvsf2-client --gossip-mask=network
>
> then look at the /tmp/pvfs-client.log (or whatever I forget) and try
> to find some patterns.  You will see these messages:
>
>         debug(2, "%s: recv RTS_DONE mop_id %llx", __func__,
>
> whenever the client gets a MSG_RTS_DONE.  If you see a duplicate
> mop_id (or not) before your assert, that will help us narrow the
> problem.
>
> You can also turn debugging on the server, with "pvfs2-set-debugmask
> -m /pvfs network", and watch him say he has sent RTS_DONE of certain
> mopid.  I don't think this will add any information yet, but fyi.
>
> Probably easier to deal with all this on a single-server setup, if
> possible.
>
>               -- Pete
>
Thanks for the quick response!

I knew this was going to be tricky to debug, this failure usually doesnt
occur until about 100GBytes into a read for us.  I have an identical
failure using a single node.  So far I've eliminated all but our opteron
systems from the tests so we're on a relatively 'stable' systems wrt to
IB.

I'll look at this tomorrow and see if I can get the logs to be of any help.

>From what you are saying, this isnt likely a duplicate message from ib,
but some duplicate mopid or a corrupt mopid?

~Kyle




_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to