[EMAIL PROTECTED] wrote on Fri, 15 Sep 2006 09:58 -0500:
> I've been trying to debug some issues with my MD server going down, or
> rather timing out and closing the connections for some reason, and
> canceling bmi jobs. While doing so, I ran into a segfaulting issue in
> openib_close_connection:
This job cancel scenario is tricky. I plan to look at this soon, to
figure out all the possible cases where things can hang up, but it
won't be trivial.
> static void openib_close_connection(ib_connection_t *c)
> {
> int ret;
> struct openib_connection_priv *oc = c->priv;
>
> /* destroy the queue pairs */
>
> <snip>
>
> free(oc);
> }
>
> Since my gdb backtrace doesnt go into any ibv_* functions, I'm assuming
> this free() call is the culprit.
> I'm not sure why this free() could be getting into a segfault, but I'm
> thinking it may be a good idea for now until we can work out why it's
> closing the connections, to put a check in there to make sure oc is
> still valid.
>
> Has anyone run into this or other issues with servers going down in openib?
Something stomped on the malloc arena. oc should be valid as
mediated by the c->refcnt higher up. I'm not sure how to test
for validity. There should be some interesting log messages
though, like
debug(2, "%s: closing connection to %s", __func__, c->peername);
or
debug(1, "%s: refcnt non-zero %d, delaying free", __func__, c->refcnt);
that may help us to figure this out if you turned on network
logging. (Please, more logs!)
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers