These are all memfree cards, hrm, that throws out one of my ideas that
this was related to having nics with memory, which 'tend to have lower
resource capacities' ...
So looking back at your log, I saw that you're getting a 'lost mopid',
which mmeans that we lost a message somewhere along the lines, this is
generally due to a server going out to lunch, or network problems.

This bug/error is a result of failing an assert in the bmi-ib layer,
and we've made some modifications to the lines that precede this
recently, I'm wondering if those are somehow incorrect or need some
other checks, Pete is our resident expert, I'll see if he has some
insight.

We will probably need to know the state of the system when this
assertion fails.. so you'll need to run gdb with your server processes
and try to break here:
(ib.c)
        bmi_ib_assert(rq, "%s: mop_id %llx in RTS_DONE message not found",
                      __func__, llu(mh_rts_done.mop_id));

You may find it easier to put in a line above this such as:
if(!rq)
   printf("error\n");
and set a breakpoint on that line since I cant remember how well you
can set breakpoints meaningfully around assertions.

Not sure if debug masks will get us to a solution here, but Pete may
say differently.


Pete I think this is the same error, or at least one of the same
errors I've stumbled across and not been able to figure out a solution
for yet, do you have any ideas/comments?

~Kyle


On Mon, Mar 31, 2008 at 7:47 PM, Eric J. Walter
<[EMAIL PROTECTED]> wrote:
>
>  Kyle,
>
>  Clients: ~120 dual core / dual proc 2.6-3.0 GHz Opterons w/ 8-32GB of
>  memory each with one SilverStorm 9000 DDR PCI-Express single port HCA
>  (lspci says: InfiniBand: Mellanox Technologies MT25204 [InfiniHost III
>  Lx HCA] (rev 20)). All mount the pvfs filesystem via Infiniband so I
>  guess the ethernet NIC isn't important (just in case: Ethernet
>  controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet
>  PCI Express (rev 21)).
>
>  3 I/O servers (2 I/O + 1 Metadata+I/O).  Each is a dual core / dual
>  proc 2.8 GHz Opteron with 8 GB memory and the same Infiniband HCA as
>  the clients. Each server has 4X146 GB SAS5 with hardware RAID 0.  The
>  total file system is ~1.5 TB.
>
>  The Infiniband switch is a SilverStorm/Qlogic 9120 4x DDR.
>
>  Did I leave something out?
>
>  Thanks again,
>
>  Eric
>
>
>
>
>
>
>
>
>
>  On Sun, Mar 30, 2008 at 06:35:30PM -0500, Kyle Schochenmaier wrote:
>  > We are currently trying to track down this bug, as well as one other
>  > involving potential data corruption under heavy load.
>  > I would like to say that I havent seen this bug after some patches
>  > that were committed a while back.
>  >
>  > Can you include some more detailed information about your hardware
>  > setup.. the types of nics specifically.
>  > --We've found some bugs that occur on slower nics but not on faster
>  > nics, so knowing what hardware you are running might help us out here.
>  >
>  > Tomorrow I can sit down and look at this further, also I'm going to cc
>  > this to the pvfs2-dev list.
>  >
>  > ~Kyle
>  >
>  >
>  > On Sun, Mar 30, 2008 at 1:05 PM, Eric J. Walter
>  > <[EMAIL PROTECTED]> wrote:
>  > > Dear pvfs2-users,
>  > >
>  > >  I have been trying to get pvfs2 working over infiniband for a few
>  > >  weeks now and have made a lot of progress.  I am still stuck on one
>  > >  last thing I can't seem to fix.
>  > >
>  > >  Basically, everything will be fine for a while (like a few days), then
>  > >  I see the following in one of the pvfs2-server.logs (when the
>  > >  debugging mask is set to "all"):
>  > >
>  > >  [E 03/30 11:50] Error: encourage_recv_incoming: mop_id 680cc0 in 
> RTS_DONE message not found.
>  > >  [E 03/30 11:50]         [bt] 
> /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(error+0xbd) [0x45d9ed]
>  > >  [E 03/30 11:50]         [bt] 
> /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x45b571]
>  > >  [E 03/30 11:50]         [bt] 
> /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x45d281]
>  > >  [E 03/30 11:50]         [bt] 
> /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(BMI_testcontext+0x120) 
> [0x43cd40]
>  > >  [E 03/30 11:50]         [bt] 
> /share/apps/pvfs2_032308CVS/sbin/pvfs2-server [0x43508d]
>  > >  [E 03/30 11:50]         [bt] /lib64/tls/libpthread.so.0 [0x354b90610a]
>  > >  [E 03/30 11:50]         [bt] /lib64/tls/libc.so.6(__clone+0x73) 
> [0x354b0c68c3]
>  > >
>  > >  At this point all mounts will be hung and will require a
>  > >  restart/remount of all servers and clients, and all jobs using this
>  > >  space will need to be restarted.
>  > >
>  > >  Only one server seems to ever suffer this problem, i.e. we have 3
>  > >  servers total for I/O (one for both metadata and I/O) and this message
>  > >  can occur on any of the 3 servers.
>  > >
>  > >  It seems that this occurs only when the number of clients accessing
>  > >  gets larger than say, 15-20 or perhaps it is a filesystem load issue?
>  > >  I haven't been able to tell...
>  > >
>  > >  I am using the CVS version from 03/23/08 (I have also tried version
>  > >  2.6.3 but this had other problems mentioned in the pvfs2 users mailing
>  > >  list, so I decided to go to the CVS version).
>  > >
>  > >  I am using OFED version 1.1 on a cluster of dual core/processor
>  > >  Opterons running kernel 2.6.9-42.ELsmp.  We have 114 clients which
>  > >  mount the pvfs file space over infiniband and use it as scratch space.
>  > >  They don't use mpi-io/romio they just write directly to the pvfs2 file
>  > >  space mounted via IB (I guess they write through the kernel
>  > >  interface). The errors seem to occur when more than 15-20 processors
>  > >  worth of jobs try and read/write to the pvfs scratch space, or they
>  > >  could be just random.
>  > >
>  > >  Does anyone have some clues for how to debug this further or track
>  > >  down what the problem is?
>  > >
>  > >  Any suggestions are welcome.
>  > >
>  > >  Thanks,
>  > >
>  > >  Eric J. Walter
>  > >  Department of Physics
>  > >  College of William and Mary
>  > >
>  > >
>  > >  _______________________________________________
>  > >  Pvfs2-users mailing list
>  > >  [EMAIL PROTECTED]
>  > >  http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>  > >
>  >
>  >
>  >
>  > --
>  > Kyle Schochenmaier
>



-- 
Kyle Schochenmaier
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to