Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Fri, 15 Sep 2006 09:58 -0500:
I've been trying to debug some issues with my MD server going down, or
rather timing out and closing the connections for some reason, and
canceling bmi jobs. While doing so, I ran into a segfaulting issue in
openib_close_connection:
This job cancel scenario is tricky. I plan to look at this soon, to
figure out all the possible cases where things can hang up, but it
won't be trivial.
static void openib_close_connection(ib_connection_t *c)
{
int ret;
struct openib_connection_priv *oc = c->priv;
/* destroy the queue pairs */
<snip>
free(oc);
}
Since my gdb backtrace doesnt go into any ibv_* functions, I'm assuming
this free() call is the culprit.
I'm not sure why this free() could be getting into a segfault, but I'm
thinking it may be a good idea for now until we can work out why it's
closing the connections, to put a check in there to make sure oc is
still valid.
Has anyone run into this or other issues with servers going down in openib?
Something stomped on the malloc arena. oc should be valid as
mediated by the c->refcnt higher up. I'm not sure how to test
for validity. There should be some interesting log messages
though, like
debug(2, "%s: closing connection to %s", __func__, c->peername);
or
debug(1, "%s: refcnt non-zero %d, delaying free", __func__, c->refcnt);
that may help us to figure this out if you turned on network
logging. (Please, more logs!)
-- Pete
Doh, I thought I had a log included here, I guess not, here's the log,
note that the second close connection message?
[D 10:33:03.279050] memcache_register: hit [0] 0x2aaaaadc3010 len 262144
(via 0x
2aaaaadc3010 len 262144) refcnt now 2.
[D 10:33:03.279060] BMI_post_send: addr: 5115, offset: 0x2aaaaae04010,
size: 262
144, tag: 14
[D 10:33:03.279071] memcache_register: hit [0] 0x2aaaaae04010 len 262144
(via 0x
2aaaaae04010 len 262144) refcnt now 2.
[D 10:33:03.279081] BMI_post_send: addr: 5115, offset: 0x2aaaaae45010,
size: 131
072, tag: 14
[D 10:33:03.279092] memcache_register: hit [0] 0x2aaaaae45010 len 131072
(via 0x
2aaaaae45010 len 262144) refcnt now 2.
[E 10:33:33.839001] Job time out: cancelling flow operation, job_id: 5152.
[E 10:33:33.839028] Flow proto cancel called on 0x2aaaaac00920
[E 10:33:33.839041] Flow proto error cleanup started on 0x2aaaaac00920,
error_co
de: -1610612737
[D 10:33:33.839067] BMI_cancel: cancel id 5154
[D 10:33:33.839127] memcache_deregister: dec refcount [0] 0x2aaaaab91010
len 262
144 count now 1.
[D 10:33:33.839146] memcache_deregister: dec refcount [0] 0x2aaaaad41010
len 262
144 count now 1.
[D 10:33:33.839154] memcache_deregister: dec refcount [0] 0x2aaaaad82010
len 262
144 count now 1.
[D 10:33:33.839162] memcache_deregister: dec refcount [0] 0x2aaaaadc3010
len 262
144 count now 1.
[D 10:33:33.839170] memcache_deregister: dec refcount [0] 0x2aaaaae04010
len 262
144 count now 1.
[D 10:33:33.839178] memcache_deregister: dec refcount [0] 0x2aaaaae45010
len 131
072 count now 1.
[D 10:33:33.839189] BMI_cancel: cancel id 5155
[D 10:33:33.839200] BMI_cancel: cancel id 5158
[D 10:33:33.839210] BMI_cancel: cancel id 5159
[D 10:33:33.839221] BMI_cancel: cancel id 5157
[D 10:33:33.839232] BMI_cancel: cancel id 5156
[D 10:33:33.839242] BMI_cancel: cancel id 5153
[E 10:33:33.839253] Flow proto 0x2aaaaac00920 canceling a total of 7 BMI
or Trov
e operations
[D 10:33:33.839260] test_sq: sq 0x603e60 cancelled.
[D 10:33:33.839282] test_sq: sq 0x603ef0 cancelled.
[D 10:33:33.839291] test_sq: sq 0x604100 cancelled.
[D 10:33:33.839299] test_sq: sq 0x6044c0 cancelled.
[D 10:33:33.839308] test_sq: sq 0x60bb50 cancelled.
[D 10:33:33.839319] BMI_testcontext completing: 5153
[D 10:33:33.839327] BMI_testcontext completing: 5154
[D 10:33:33.839334] BMI_testcontext completing: 5155
[D 10:33:33.839342] BMI_testcontext completing: 5156
[D 10:33:33.839349] BMI_testcontext completing: 5157
[D 10:33:33.839365] test_sq: sq 0x60bcc0 cancelled.
[D 10:33:33.839374] test_sq: sq 0x60be30 cancelled.
[D 10:33:33.839382] ib_close_connection: closing connection to
10.1.5.218:59985.
[D 10:33:33.839963] ib_close_connection: closing connection to 0>`.
I'm not sure why we're getting two 'closing connection to' messages --
indeed this may be our problem now that I think of it, someone may be
trying to close it twice -- but I can tell you that I always see two of
these messages??
I'm also still seeing weird error codes, and am not sure if we've
addressed this yet, but doubt thats our real problem.
hope this helps,
-- Kyle
--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers