> [EMAIL PROTECTED] wrote on Tue, 29 Jan 2008 16:09 -0600: >> I've been running GAMESS tests with about 160GB's on the filesystem >> trying >> to stress the network a bit and have managed to reproducibly get the >> pvfs2-client to end on an assertion failure in >> "src/io/bmi/bmi_ib/ib.c:611" >> >> I havent been able to figure out exactly what is occuring that is >> causing >> this assertion failure, but from the code it really appears as if this >> shouldnt ever be occuring, obviously (assertion) ;) Maybe we're getting >> duplicate messages or double-testing a message somehow. >> >> I'm running cvs HEAD, debian 2.6.18, and using bmi_ib modules over the >> vfs. >> >> [E 15:54:20.197159] Error: encourage_recv_incoming: RTS_DONE to rq wrong >> state RQ_RTS_WAITING_USER_TEST. >> [E 15:54:20.200927] [bt] pvfs2-client-core(error+0xca) [0x41a2ba] >> [E 15:54:20.200940] [bt] pvfs2-client-core [0x41779f] >> [E 15:54:20.200948] [bt] pvfs2-client-core [0x417e3a] >> [E 15:54:20.200955] [bt] pvfs2-client-core [0x4181fd] >> [E 15:54:20.200963] [bt] pvfs2-client-core(job_bmi_recv+0xea) >> [0x422f0a] >> [E 15:54:20.200971] [bt] pvfs2-client-core [0x441a18] >> [E 15:54:20.200978] [bt] >> pvfs2-client-core(PINT_state_machine_invoke+0xd2) [ >> 0x431be2] >> [E 15:54:20.200986] [bt] >> pvfs2-client-core(PINT_state_machine_next+0xcc) [0x >> 43198c] >> [E 15:54:20.200994] [bt] >> pvfs2-client-core(PINT_client_state_machine_post+0x >> 99) [0x4383e9] >> [E 15:54:20.201001] [bt] pvfs2-client-core(PVFS_isys_io+0x324) >> [0x4430a4] >> [E 15:54:20.201009] [bt] pvfs2-client-core [0x4117a6] >> [E 15:54:20.205453] pvfs2-client-core with pid 6251 exited with value 1 > > That is indeed scary. The server has sent MSG_RTS_DONE to the > client. The client looks up the mop_id (64-bit number in header) > and finds it corresponds to a message that it thought had already > been completed. The message is in "waiting user test" which means > IB is all done, it just is waiting for the upper layers to ask for > the completion status. > > You could turn on debugging, level 2, which I think is the default. > Enable it on the client core by starting it up with > > pvsf2-client --gossip-mask=network > > then look at the /tmp/pvfs-client.log (or whatever I forget) and try > to find some patterns. You will see these messages: > > debug(2, "%s: recv RTS_DONE mop_id %llx", __func__, > > whenever the client gets a MSG_RTS_DONE. If you see a duplicate > mop_id (or not) before your assert, that will help us narrow the > problem. > > You can also turn debugging on the server, with "pvfs2-set-debugmask > -m /pvfs network", and watch him say he has sent RTS_DONE of certain > mopid. I don't think this will add any information yet, but fyi. > > Probably easier to deal with all this on a single-server setup, if > possible. > > -- Pete > Thanks for the quick response!
I knew this was going to be tricky to debug, this failure usually doesnt occur until about 100GBytes into a read for us. I have an identical failure using a single node. So far I've eliminated all but our opteron systems from the tests so we're on a relatively 'stable' systems wrt to IB. I'll look at this tomorrow and see if I can get the logs to be of any help. >From what you are saying, this isnt likely a duplicate message from ib, but some duplicate mopid or a corrupt mopid? ~Kyle _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
