Kyle, 

Clients: ~120 dual core / dual proc 2.6-3.0 GHz Opterons w/ 8-32GB of
memory each with one SilverStorm 9000 DDR PCI-Express single port HCA
(lspci says: InfiniBand: Mellanox Technologies MT25204 [InfiniHost III
Lx HCA] (rev 20)). All mount the pvfs filesystem via Infiniband so I
guess the ethernet NIC isn't important (just in case: Ethernet
controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet
PCI Express (rev 21)).

3 I/O servers (2 I/O + 1 Metadata+I/O).  Each is a dual core / dual
proc 2.8 GHz Opteron with 8 GB memory and the same Infiniband HCA as
the clients. Each server has 4X146 GB SAS5 with hardware RAID 0.  The
total file system is ~1.5 TB.

The Infiniband switch is a SilverStorm/Qlogic 9120 4x DDR.

Did I leave something out?  

Thanks again, 

Eric







On Sun, Mar 30, 2008 at 06:35:30PM -0500, Kyle Schochenmaier wrote:
> We are currently trying to track down this bug, as well as one other
> involving potential data corruption under heavy load.
> I would like to say that I havent seen this bug after some patches
> that were committed a while back.
> 
> Can you include some more detailed information about your hardware
> setup.. the types of nics specifically.
> --We've found some bugs that occur on slower nics but not on faster
> nics, so knowing what hardware you are running might help us out here.
> 
> Tomorrow I can sit down and look at this further, also I'm going to cc
> this to the pvfs2-dev list.
> 
> ~Kyle
> 
> 
> On Sun, Mar 30, 2008 at 1:05 PM, Eric J. Walter
> <[EMAIL PROTECTED]> wrote:
> > Dear pvfs2-users,
> >
> >  I have been trying to get pvfs2 working over infiniband for a few
> >  weeks now and have made a lot of progress.  I am still stuck on one
> >  last thing I can't seem to fix.
> >
> >  Basically, everything will be fine for a while (like a few days), then
> >  I see the following in one of the pvfs2-server.logs (when the
> >  debugging mask is set to "all"):
> >
> >  [E 03/30 11:50] Error: encourage_recv_incoming: mop_id 680cc0 in RTS_DONE 
> > message not found.
> >  [E 03/30 11:50]         [bt] 
> > /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(error+0xbd) [0x45d9ed]
> >  [E 03/30 11:50]         [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server 
> > [0x45b571]
> >  [E 03/30 11:50]         [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server 
> > [0x45d281]
> >  [E 03/30 11:50]         [bt] 
> > /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(BMI_testcontext+0x120) 
> > [0x43cd40]
> >  [E 03/30 11:50]         [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server 
> > [0x43508d]
> >  [E 03/30 11:50]         [bt] /lib64/tls/libpthread.so.0 [0x354b90610a]
> >  [E 03/30 11:50]         [bt] /lib64/tls/libc.so.6(__clone+0x73) 
> > [0x354b0c68c3]
> >
> >  At this point all mounts will be hung and will require a
> >  restart/remount of all servers and clients, and all jobs using this
> >  space will need to be restarted.
> >
> >  Only one server seems to ever suffer this problem, i.e. we have 3
> >  servers total for I/O (one for both metadata and I/O) and this message
> >  can occur on any of the 3 servers.
> >
> >  It seems that this occurs only when the number of clients accessing
> >  gets larger than say, 15-20 or perhaps it is a filesystem load issue?
> >  I haven't been able to tell...
> >
> >  I am using the CVS version from 03/23/08 (I have also tried version
> >  2.6.3 but this had other problems mentioned in the pvfs2 users mailing
> >  list, so I decided to go to the CVS version).
> >
> >  I am using OFED version 1.1 on a cluster of dual core/processor
> >  Opterons running kernel 2.6.9-42.ELsmp.  We have 114 clients which
> >  mount the pvfs file space over infiniband and use it as scratch space.
> >  They don't use mpi-io/romio they just write directly to the pvfs2 file
> >  space mounted via IB (I guess they write through the kernel
> >  interface). The errors seem to occur when more than 15-20 processors
> >  worth of jobs try and read/write to the pvfs scratch space, or they
> >  could be just random.
> >
> >  Does anyone have some clues for how to debug this further or track
> >  down what the problem is?
> >
> >  Any suggestions are welcome.
> >
> >  Thanks,
> >
> >  Eric J. Walter
> >  Department of Physics
> >  College of William and Mary
> >
> >
> >  _______________________________________________
> >  Pvfs2-users mailing list
> >  [email protected]
> >  http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >
> 
> 
> 
> -- 
> Kyle Schochenmaier
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to