Kyle, Clients: ~120 dual core / dual proc 2.6-3.0 GHz Opterons w/ 8-32GB of memory each with one SilverStorm 9000 DDR PCI-Express single port HCA (lspci says: InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)). All mount the pvfs filesystem via Infiniband so I guess the ethernet NIC isn't important (just in case: Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 21)).
3 I/O servers (2 I/O + 1 Metadata+I/O). Each is a dual core / dual proc 2.8 GHz Opteron with 8 GB memory and the same Infiniband HCA as the clients. Each server has 4X146 GB SAS5 with hardware RAID 0. The total file system is ~1.5 TB. The Infiniband switch is a SilverStorm/Qlogic 9120 4x DDR. Did I leave something out? Thanks again, Eric On Sun, Mar 30, 2008 at 06:35:30PM -0500, Kyle Schochenmaier wrote: > We are currently trying to track down this bug, as well as one other > involving potential data corruption under heavy load. > I would like to say that I havent seen this bug after some patches > that were committed a while back. > > Can you include some more detailed information about your hardware > setup.. the types of nics specifically. > --We've found some bugs that occur on slower nics but not on faster > nics, so knowing what hardware you are running might help us out here. > > Tomorrow I can sit down and look at this further, also I'm going to cc > this to the pvfs2-dev list. > > ~Kyle > > > On Sun, Mar 30, 2008 at 1:05 PM, Eric J. Walter > <[EMAIL PROTECTED]> wrote: > > Dear pvfs2-users, > > > > I have been trying to get pvfs2 working over infiniband for a few > > weeks now and have made a lot of progress. I am still stuck on one > > last thing I can't seem to fix. > > > > Basically, everything will be fine for a while (like a few days), then > > I see the following in one of the pvfs2-server.logs (when the > > debugging mask is set to "all"): > > > > [E 03/30 11:50] Error: encourage_recv_incoming: mop_id 680cc0 in RTS_DONE > > message not found. > > [E 03/30 11:50] [bt] > > /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(error+0xbd) [0x45d9ed] > > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server > > [0x45b571] > > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server > > [0x45d281] > > [E 03/30 11:50] [bt] > > /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(BMI_testcontext+0x120) > > [0x43cd40] > > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server > > [0x43508d] > > [E 03/30 11:50] [bt] /lib64/tls/libpthread.so.0 [0x354b90610a] > > [E 03/30 11:50] [bt] /lib64/tls/libc.so.6(__clone+0x73) > > [0x354b0c68c3] > > > > At this point all mounts will be hung and will require a > > restart/remount of all servers and clients, and all jobs using this > > space will need to be restarted. > > > > Only one server seems to ever suffer this problem, i.e. we have 3 > > servers total for I/O (one for both metadata and I/O) and this message > > can occur on any of the 3 servers. > > > > It seems that this occurs only when the number of clients accessing > > gets larger than say, 15-20 or perhaps it is a filesystem load issue? > > I haven't been able to tell... > > > > I am using the CVS version from 03/23/08 (I have also tried version > > 2.6.3 but this had other problems mentioned in the pvfs2 users mailing > > list, so I decided to go to the CVS version). > > > > I am using OFED version 1.1 on a cluster of dual core/processor > > Opterons running kernel 2.6.9-42.ELsmp. We have 114 clients which > > mount the pvfs file space over infiniband and use it as scratch space. > > They don't use mpi-io/romio they just write directly to the pvfs2 file > > space mounted via IB (I guess they write through the kernel > > interface). The errors seem to occur when more than 15-20 processors > > worth of jobs try and read/write to the pvfs scratch space, or they > > could be just random. > > > > Does anyone have some clues for how to debug this further or track > > down what the problem is? > > > > Any suggestions are welcome. > > > > Thanks, > > > > Eric J. Walter > > Department of Physics > > College of William and Mary > > > > > > _______________________________________________ > > Pvfs2-users mailing list > > [email protected] > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > > > > -- > Kyle Schochenmaier _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
