Dear pvfs2-users, I have been trying to get pvfs2 working over infiniband for a few weeks now and have made a lot of progress. I am still stuck on one last thing I can't seem to fix.
Basically, everything will be fine for a while (like a few days), then I see the following in one of the pvfs2-server.logs (when the debugging mask is set to "all"): [E 03/30 11:50] Error: encourage_recv_incoming: mop_id 680cc0 in RTS_DONE message not found. [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(error+0xbd) [0x45d9ed] [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x45b571] [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x45d281] [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(BMI_testcontext+0x120) [0x43cd40] [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x43508d] [E 03/30 11:50] [bt] /lib64/tls/libpthread.so.0 [0x354b90610a] [E 03/30 11:50] [bt] /lib64/tls/libc.so.6(__clone+0x73) [0x354b0c68c3] At this point all mounts will be hung and will require a restart/remount of all servers and clients, and all jobs using this space will need to be restarted. Only one server seems to ever suffer this problem, i.e. we have 3 servers total for I/O (one for both metadata and I/O) and this message can occur on any of the 3 servers. It seems that this occurs only when the number of clients accessing gets larger than say, 15-20 or perhaps it is a filesystem load issue? I haven't been able to tell... I am using the CVS version from 03/23/08 (I have also tried version 2.6.3 but this had other problems mentioned in the pvfs2 users mailing list, so I decided to go to the CVS version). I am using OFED version 1.1 on a cluster of dual core/processor Opterons running kernel 2.6.9-42.ELsmp. We have 114 clients which mount the pvfs file space over infiniband and use it as scratch space. They don't use mpi-io/romio they just write directly to the pvfs2 file space mounted via IB (I guess they write through the kernel interface). The errors seem to occur when more than 15-20 processors worth of jobs try and read/write to the pvfs scratch space, or they could be just random. Does anyone have some clues for how to debug this further or track down what the problem is? Any suggestions are welcome. Thanks, Eric J. Walter Department of Physics College of William and Mary _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
