[EMAIL PROTECTED] wrote on Fri, 15 Sep 2006 09:58 -0500:
> I've been trying to debug some issues with my MD server going down, or 
> rather timing out and closing the connections for some reason, and 
> canceling bmi jobs.  While doing so, I ran into a segfaulting issue in 
> openib_close_connection:

This job cancel scenario is tricky.  I plan to look at this soon, to
figure out all the possible cases where things can hang up, but it
won't be trivial.

> static void openib_close_connection(ib_connection_t *c)
> {
>    int ret;
>    struct openib_connection_priv *oc = c->priv;
> 
>    /* destroy the queue pairs */
> 
> <snip>
> 
>    free(oc);
> }
> 
> Since my gdb backtrace doesnt go into any ibv_* functions, I'm assuming 
> this free() call is the culprit.
> I'm not sure why this free() could be getting into a segfault, but I'm 
> thinking it may be a good idea for now until we can work out why it's 
> closing the connections, to put a check in there to make sure oc is 
> still valid.
> 
> Has anyone run into this or other issues with servers going down in openib?

Something stomped on the malloc arena.  oc should be valid as
mediated by the c->refcnt higher up.  I'm not sure how to test
for validity.  There should be some interesting log messages
though, like

    debug(2, "%s: closing connection to %s", __func__, c->peername);

or 

    debug(1, "%s: refcnt non-zero %d, delaying free", __func__, c->refcnt);

that may help us to figure this out if you turned on network
logging.  (Please, more logs!)

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to