[EMAIL PROTECTED] wrote on Tue, 01 Apr 2008 14:31 -0500:
> These are all memfree cards, hrm, that throws out one of my ideas that
> this was related to having nics with memory, which 'tend to have lower
> resource capacities' ...
> So looking back at your log, I saw that you're getting a 'lost mopid',
> which mmeans that we lost a message somewhere along the lines, this is
> generally due to a server going out to lunch, or network problems.
> 
> This bug/error is a result of failing an assert in the bmi-ib layer,
> and we've made some modifications to the lines that precede this
> recently, I'm wondering if those are somehow incorrect or need some
> other checks, Pete is our resident expert, I'll see if he has some
> insight.
> 
> We will probably need to know the state of the system when this
> assertion fails.. so you'll need to run gdb with your server processes
> and try to break here:
> (ib.c)
>       bmi_ib_assert(rq, "%s: mop_id %llx in RTS_DONE message not found",
>                     __func__, llu(mh_rts_done.mop_id));
> 
> You may find it easier to put in a line above this such as:
> if(!rq)
>    printf("error\n");
> and set a breakpoint on that line since I cant remember how well you
> can set breakpoints meaningfully around assertions.
> 
> Not sure if debug masks will get us to a solution here, but Pete may
> say differently.
> 
> 
> Pete I think this is the same error, or at least one of the same
> errors I've stumbled across and not been able to figure out a solution
> for yet, do you have any ideas/comments?

Sorry I can't really pay much attention to this for the next few
weeks.  I thought this mop_id was a good bug that you found and we
fixed it in the CVS.  Perhaps Eric does not have that fix?

It would be highly interesting to track this down further, if it is
post-fix and somewhat repeatable.  You'll recall the approach we did
before:  full logs on 1 client and 1 server, then I started at the
end and walked backward looking for the mop_id in question to figure
out what went wrong.  Perhaps you can take a shot at this with logs
from Eric (off-list or via http for big files is generally nicer).

The outstanding issue with IB is lots of clients hitting a server
and (guessing) RDMA operations chewing up too many WRs.  You and
Troy were working on a solution for that, or at least tracking so
we know where we are going wrong with the NIC resource management.

Parting shot.  If any of this happens _after_ a timeout failure, all
bets are off.  We have had numerous issues with cancel.  There could
be bugs in the ib method or upwards through the stack that cause
breakage in that case.

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to