Dear pvfs2-users,

I have been trying to get pvfs2 working over infiniband for a few
weeks now and have made a lot of progress.  I am still stuck on one
last thing I can't seem to fix.

Basically, everything will be fine for a while (like a few days), then
I see the following in one of the pvfs2-server.logs (when the
debugging mask is set to "all"):

[E 03/30 11:50] Error: encourage_recv_incoming: mop_id 680cc0 in RTS_DONE 
message not found.
[E 03/30 11:50]         [bt] 
/share/apps/pvfs2_032308CVS/sbin/pvfs2-server(error+0xbd) [0x45d9ed]
[E 03/30 11:50]         [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server 
[0x45b571]
[E 03/30 11:50]         [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server 
[0x45d281]
[E 03/30 11:50]         [bt] 
/share/apps/pvfs2_032308CVS/sbin/pvfs2-server(BMI_testcontext+0x120) [0x43cd40]
[E 03/30 11:50]         [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server 
[0x43508d]
[E 03/30 11:50]         [bt] /lib64/tls/libpthread.so.0 [0x354b90610a]
[E 03/30 11:50]         [bt] /lib64/tls/libc.so.6(__clone+0x73) [0x354b0c68c3]

At this point all mounts will be hung and will require a
restart/remount of all servers and clients, and all jobs using this
space will need to be restarted.

Only one server seems to ever suffer this problem, i.e. we have 3
servers total for I/O (one for both metadata and I/O) and this message
can occur on any of the 3 servers.

It seems that this occurs only when the number of clients accessing
gets larger than say, 15-20 or perhaps it is a filesystem load issue?
I haven't been able to tell...

I am using the CVS version from 03/23/08 (I have also tried version
2.6.3 but this had other problems mentioned in the pvfs2 users mailing
list, so I decided to go to the CVS version).

I am using OFED version 1.1 on a cluster of dual core/processor
Opterons running kernel 2.6.9-42.ELsmp.  We have 114 clients which
mount the pvfs file space over infiniband and use it as scratch space.
They don't use mpi-io/romio they just write directly to the pvfs2 file
space mounted via IB (I guess they write through the kernel
interface). The errors seem to occur when more than 15-20 processors
worth of jobs try and read/write to the pvfs scratch space, or they
could be just random.

Does anyone have some clues for how to debug this further or track
down what the problem is?

Any suggestions are welcome.

Thanks, 

Eric J. Walter
Department of Physics
College of William and Mary


_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to