[Pvfs2-developers] Re: segfault in openib code

Kyle Schochenmaier Mon, 18 Sep 2006 06:08:57 -0700

Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Fri, 15 Sep 2006 09:58 -0500:
I've been trying to debug some issues with my MD server going down, orrather timing out and closing the connections for some reason, andcanceling bmi jobs. While doing so, I ran into a segfaulting issue inopenib_close_connection:
This job cancel scenario is tricky.  I plan to look at this soon, to
figure out all the possible cases where things can hang up, but it
won't be trivial.
static void openib_close_connection(ib_connection_t *c)
{
   int ret;
   struct openib_connection_priv *oc = c->priv;

   /* destroy the queue pairs */

<snip>

   free(oc);
}
Since my gdb backtrace doesnt go into any ibv_* functions, I'm assumingthis free() call is the culprit.I'm not sure why this free() could be getting into a segfault, but I'mthinking it may be a good idea for now until we can work out why it'sclosing the connections, to put a check in there to make sure oc isstill valid.
Has anyone run into this or other issues with servers going down in openib?
Something stomped on the malloc arena.  oc should be valid as
mediated by the c->refcnt higher up.  I'm not sure how to test
for validity.  There should be some interesting log messages
though, like

    debug(2, "%s: closing connection to %s", __func__, c->peername);
or
    debug(1, "%s: refcnt non-zero %d, delaying free", __func__, c->refcnt);

that may help us to figure this out if you turned on network
logging.  (Please, more logs!)

                -- Pete

Doh, I thought I had a log included here, I guess not, here's the log,note that the second close connection message?

[D 10:33:03.279050] memcache_register: hit [0] 0x2aaaaadc3010 len 262144(via 0x

2aaaaadc3010 len 262144) refcnt now 2.

[D 10:33:03.279060] BMI_post_send: addr: 5115, offset: 0x2aaaaae04010,size: 262

144, tag: 14

[D 10:33:03.279071] memcache_register: hit [0] 0x2aaaaae04010 len 262144(via 0x

2aaaaae04010 len 262144) refcnt now 2.

[D 10:33:03.279081] BMI_post_send: addr: 5115, offset: 0x2aaaaae45010,size: 131

072, tag: 14

[D 10:33:03.279092] memcache_register: hit [0] 0x2aaaaae45010 len 131072(via 0x

2aaaaae45010 len 262144) refcnt now 2.
[E 10:33:33.839001] Job time out: cancelling flow operation, job_id: 5152.
[E 10:33:33.839028] Flow proto cancel called on 0x2aaaaac00920

[E 10:33:33.839041] Flow proto error cleanup started on 0x2aaaaac00920,error_co

de: -1610612737
[D 10:33:33.839067] BMI_cancel: cancel id 5154

[D 10:33:33.839127] memcache_deregister: dec refcount [0] 0x2aaaaab91010len 262

144 count now 1.

[D 10:33:33.839146] memcache_deregister: dec refcount [0] 0x2aaaaad41010len 262

144 count now 1.

[D 10:33:33.839154] memcache_deregister: dec refcount [0] 0x2aaaaad82010len 262

144 count now 1.

[D 10:33:33.839162] memcache_deregister: dec refcount [0] 0x2aaaaadc3010len 262

144 count now 1.

[D 10:33:33.839170] memcache_deregister: dec refcount [0] 0x2aaaaae04010len 262

144 count now 1.

[D 10:33:33.839178] memcache_deregister: dec refcount [0] 0x2aaaaae45010len 131

072 count now 1.
[D 10:33:33.839189] BMI_cancel: cancel id 5155
[D 10:33:33.839200] BMI_cancel: cancel id 5158
[D 10:33:33.839210] BMI_cancel: cancel id 5159
[D 10:33:33.839221] BMI_cancel: cancel id 5157
[D 10:33:33.839232] BMI_cancel: cancel id 5156
[D 10:33:33.839242] BMI_cancel: cancel id 5153

[E 10:33:33.839253] Flow proto 0x2aaaaac00920 canceling a total of 7 BMIor Trov

e operations
[D 10:33:33.839260] test_sq: sq 0x603e60 cancelled.
[D 10:33:33.839282] test_sq: sq 0x603ef0 cancelled.
[D 10:33:33.839291] test_sq: sq 0x604100 cancelled.
[D 10:33:33.839299] test_sq: sq 0x6044c0 cancelled.
[D 10:33:33.839308] test_sq: sq 0x60bb50 cancelled.
[D 10:33:33.839319] BMI_testcontext completing: 5153
[D 10:33:33.839327] BMI_testcontext completing: 5154
[D 10:33:33.839334] BMI_testcontext completing: 5155
[D 10:33:33.839342] BMI_testcontext completing: 5156
[D 10:33:33.839349] BMI_testcontext completing: 5157
[D 10:33:33.839365] test_sq: sq 0x60bcc0 cancelled.
[D 10:33:33.839374] test_sq: sq 0x60be30 cancelled.

[D 10:33:33.839382] ib_close_connection: closing connection to10.1.5.218:59985.

[D 10:33:33.839963] ib_close_connection: closing connection to 0>`.

I'm not sure why we're getting two 'closing connection to' messages --indeed this may be our problem now that I think of it, someone may betrying to close it twice -- but I can tell you that I always see two ofthese messages??

I'm also still seeing weird error codes, and am not sure if we'veaddressed this yet, but doubt thats our real problem.


hope this helps,

   -- Kyle

--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy

Scalable Computing Laboratory

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

[Pvfs2-developers] Re: segfault in openib code

Reply via email to