[EMAIL PROTECTED] wrote on Tue, 19 Feb 2008 16:59 -0600:
> Here's a little bit more info..
>
> [E 02/19 17:00] max send/recv sge 29 30
> [E 02/19 17:01] job_time_mgr_expire: job time out: cancelling flow
> operation, job_id: 2437.
> [E 02/19 17:01] fp_multiqueue_cancel: flow proto cancel called on 0x6216c0
> [E 02/19 17:01] handle_io_error: flow proto error cleanup started on
> 0x6216c0, error_code: -1610613121
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 47602573279216 (LWP 4049)]
> memcache_deregister (md=<value optimized out>, buflist=0x664ff8) at
> ../src/io/bmi/bmi_ib/mem.c:317
> 317 --c->count;
> (gdb)
> (gdb)
> (gdb)
> (gdb)
> (gdb) list
> 312
> 313 gen_mutex_lock(&memcache_device->mutex);
> 314 for (i=0; i<buflist->num; i++) {
> 315 #if ENABLE_MEMCACHE
> 316 memcache_entry_t *c = buflist->memcache[i];
> 317 --c->count;
> 318 debug(2,
> 319 "%s: dec refcount [%d] %p len %lld (via %p len %lld)
> refcnt now %d",
> 320 __func__, i, buflist->buf.send[i], lld(buflist->len[i]),
> 321 c->buf, lld(c->len), c->count);
Would love to see a backtrace. And walk up a couple functions and
take a look at rq or sq from whence the buflist came. And look
at rq->c to see if it was cancelled or otherwise funky.
If you get back to debugging the WQ overflow issue, let me know.
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers