> [EMAIL PROTECTED] wrote on Tue, 29 Jan 2008 16:09 -0600:
>> I've been running GAMESS tests with about 160GB's on the filesystem
>> trying
>> to stress the network a bit and have managed to reproducibly get the
>> pvfs2-client to end on an assertion failure in
>> "src/io/bmi/bmi_ib/ib.c:611"
>>
>> I havent been able to figure out exactly what is occuring that is
>> causing
>> this assertion failure, but from the code it really appears as if this
>> shouldnt ever be occuring, obviously (assertion) ;)  Maybe we're getting
>> duplicate messages or double-testing a message somehow.
>>
>> I'm running cvs HEAD, debian 2.6.18, and using bmi_ib modules over the
>> vfs.
>>
>> [E 15:54:20.197159] Error: encourage_recv_incoming: RTS_DONE to rq wrong
>> state RQ_RTS_WAITING_USER_TEST.
>> [E 15:54:20.200927]     [bt] pvfs2-client-core(error+0xca) [0x41a2ba]
>> [E 15:54:20.200940]     [bt] pvfs2-client-core [0x41779f]
>> [E 15:54:20.200948]     [bt] pvfs2-client-core [0x417e3a]
>> [E 15:54:20.200955]     [bt] pvfs2-client-core [0x4181fd]
>> [E 15:54:20.200963]     [bt] pvfs2-client-core(job_bmi_recv+0xea)
>> [0x422f0a]
>> [E 15:54:20.200971]     [bt] pvfs2-client-core [0x441a18]
>> [E 15:54:20.200978]     [bt]
>> pvfs2-client-core(PINT_state_machine_invoke+0xd2) [
>> 0x431be2]
>> [E 15:54:20.200986]     [bt]
>> pvfs2-client-core(PINT_state_machine_next+0xcc) [0x
>> 43198c]
>> [E 15:54:20.200994]     [bt]
>> pvfs2-client-core(PINT_client_state_machine_post+0x
>> 99) [0x4383e9]
>> [E 15:54:20.201001]     [bt] pvfs2-client-core(PVFS_isys_io+0x324)
>> [0x4430a4]
>> [E 15:54:20.201009]     [bt] pvfs2-client-core [0x4117a6]
>> [E 15:54:20.205453] pvfs2-client-core with pid 6251 exited with value 1
>
> That is indeed scary.  The server has sent MSG_RTS_DONE to the
> client.  The client looks up the mop_id (64-bit number in header)
> and finds it corresponds to a message that it thought had already
> been completed.  The message is in "waiting user test" which means
> IB is all done, it just is waiting for the upper layers to ask for
> the completion status.
>
> You could turn on debugging, level 2, which I think is the default.
> Enable it on the client core by starting it up with
>
>     pvsf2-client --gossip-mask=network
>
> then look at the /tmp/pvfs-client.log (or whatever I forget) and try
> to find some patterns.  You will see these messages:
>
>         debug(2, "%s: recv RTS_DONE mop_id %llx", __func__,
>
> whenever the client gets a MSG_RTS_DONE.  If you see a duplicate
> mop_id (or not) before your assert, that will help us narrow the
> problem.
>
> You can also turn debugging on the server, with "pvfs2-set-debugmask
> -m /pvfs network", and watch him say he has sent RTS_DONE of certain
> mopid.  I don't think this will add any information yet, but fyi.
>
> Probably easier to deal with all this on a single-server setup, if
> possible.
>
>               -- Pete
>

Pete -

I've attached a link to a log of the failure with network debugging on in
the client, single IO node.  The whole log is 5.9GB so I only attached the
last 10k lines.  Same error as before of course.

http://www.scl.ameslab.gov/~kschoche/pvfs2-client.log.gz

The mopids are fairly difficult to track as they are used all over the
place and end up here and there, I cant make out anything useful from it
:'(

Any advice would be great,

~Kyle

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to