[Pvfs2-developers] Re: bmi_ib failure on memfull HCA's

Pete Wyckoff Sat, 23 Feb 2008 12:42:52 -0800

[EMAIL PROTECTED] wrote on Tue, 19 Feb 2008 16:59 -0600:
> Here's a little bit more info..
>
> [E 02/19 17:00] max send/recv sge 29 30
> [E 02/19 17:01] job_time_mgr_expire: job time out: cancelling flow 
> operation, job_id: 2437.
> [E 02/19 17:01] fp_multiqueue_cancel: flow proto cancel called on 0x6216c0
> [E 02/19 17:01] handle_io_error: flow proto error cleanup started on 
> 0x6216c0, error_code: -1610613121
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 47602573279216 (LWP 4049)]
> memcache_deregister (md=<value optimized out>, buflist=0x664ff8) at 
> ../src/io/bmi/bmi_ib/mem.c:317
> 317             --c->count;
> (gdb)
> (gdb)
> (gdb)
> (gdb)
> (gdb) list
> 312
> 313         gen_mutex_lock(&memcache_device->mutex);
> 314         for (i=0; i<buflist->num; i++) {
> 315     #if ENABLE_MEMCACHE
> 316             memcache_entry_t *c = buflist->memcache[i];
> 317             --c->count;
> 318             debug(2,
> 319                "%s: dec refcount [%d] %p len %lld (via %p len %lld) 
> refcnt now %d",
> 320                __func__, i, buflist->buf.send[i], lld(buflist->len[i]),
> 321                c->buf, lld(c->len), c->count);


Would love to see a backtrace.  And walk up a couple functions and
take a look at rq or sq from whence the buflist came.  And look
at rq->c to see if it was cancelled or otherwise funky.

If you get back to debugging the WQ overflow issue, let me know.

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

[Pvfs2-developers] Re: bmi_ib failure on memfull HCA's

Reply via email to