[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 17:14 -0600:
> I'm getting a sig11 with the power5 client.. Here's a bunch of debugging
> info.. now where do I got next?
>
> [D 16:57:30.033896] BMI_post_sendunexpected_list: addr: 269231512, count:
> 1, tot
> al_size: 52, tag: 15
> [D 16:57:30.033926] element 0: offset: 0x1013d390, size: 52
> [D 16:57:30.033955] post_send: sq 0x100c7f60 len 52 peer da13:3345.
> [D 16:57:30.033984] encourage_send_waiting_buffer: sq 0x100c7f60 sent EAGER
> len
> 52.
> [D 16:57:30.034019] ib_check_cq: send to da13:3345 completed locally: sq
> 0x100c7
> f60 -> SQ_WAITING_USER_TEST.
> [D 16:57:30.034047] test_sq: sq 0x100c7f60 completed 52 to da13:3345.
> [D 16:57:30.034162] ib_check_cq: recv from da13:3345 len 104 type
> MSG_EAGER_SEND
> credit 1.
> [D 16:57:30.034191] encourage_recv_incoming: recv eager len 104.
> [D 16:57:30.034216] encourage_recv_incoming: matched rq 0x100d8790 now
> RQ_EAGER_
> WAITING_USER_TEST.
> [D 16:57:30.034246] encourage_recv_incoming: early registration not needed,
> dere g after eager.
> [D 16:57:30.034276] memcache_deregister: dec refcount [0] 0x10146930 len
> 8224 (v ia 0x10146930 len 8224) refcnt now 1.
> [D 16:57:30.034307] test_rq: rq 0x100d8790 completed 88 from da13:3345.
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread -134410208 (LWP 6302)]
> completion_list_retrieve_completed (op_id_array=0xfff7d710,
> user_ptr_array=0xfff7d310, error_code_array=0xfff7d410, limit=64,
> out_count=0xfff7d2f0) at ../src/client/sysint/client-state-machine.c:141
> 141 op_id_array[i] = sm_p->sys_op_id;
> (gdb)
> (gdb)
> (gdb)
> (gdb)
> (gdb) bt
> #0 completion_list_retrieve_completed (op_id_array=0xfff7d710,
> user_ptr_array=0xfff7d310, error_code_array=0xfff7d410, limit=64,
> out_count=0xfff7d2f0) at ../src/client/sysint/client-state-machine.c:141
> #1 0x100441b4 in PINT_client_state_machine_testsome
> (op_id_array=0xfff7d710,
> op_count=0xfff7d2f0, user_ptr_array=0xfff7d310,
> error_code_array=0xfff7d410, timeout_ms=10)
> at ../src/client/sysint/client-state-machine.c:694
> #2 0x10010c00 in process_vfs_requests ()
> at ../src/apps/kernel/linux/pvfs2-client-core.c:2943
> #3 0x100120f4 in main (argc=<value optimized out>, argv=0xfff7dc74)
> at ../src/apps/kernel/linux/pvfs2-client-core.c:3379
> (gdb) print sm_p
> $1 = (PINT_client_sm *) 0x0
> (gdb)
> $2 = (PINT_client_sm *) 0x0
> (gdb) list
> 136 assert(smcb);
> 137
> 138 if (i < limit)
> 139 {
> 140 sm_p = PINT_sm_frame(smcb, PINT_FRAME_CURRENT);
> 141 op_id_array[i] = sm_p->sys_op_id;
> 142 error_code_array[i] = sm_p->error_code;
> 143
> 144 if (user_ptr_array)
> 145 {
> (gdb) print smcb
> No symbol "smcb" in current context.
> (gdb) list -
> 126
> 127 gen_mutex_lock(&s_completion_list_mutex);
> 128 for(i = 0; i < s_completion_list_index; i++)
> 129 {
> 130 if (s_completion_list[i] == NULL)
> 131 {
> 132 continue;
> 133 }
> 134
> 135 smcb = s_completion_list[i];
> (gdb) print s_completion_list[0]
> $3 = (PINT_smcb *) 0x100da450
> (gdb) print *s_completion_list[0]
> $4 = {stackptr = 0, current_state = 0x100b0068, state_stack = {0x100aff90,
> 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, frames = {next = 0x100da478,
> prev = 0x100da478}, frame_count = 1,
> op_get_state_machine = 0x10043b80 <client_op_state_get_machine>, op = 5,
> op_id = 0, parent_smcb = 0x0, op_terminate = 1, op_cancelled = 0,
> children_running = 0, op_completed = 1, context = 0,
> terminate_fn = 0x100452a0 <client_state_machine_terminate>, user_ptr =
> 0x0}
All I get from this is that the frames qlist has a single entry,
state_stack[4]. Not sure how it got so deep into there. Likely
some sort of memory corruption, or we have a fairly major
undiscovered SM bug on our hands.
If you can repeat this at will, doing a -g build and running with
all debugging would be especially nice. Maybe the debug log would
show something curious.
The other approach is to run under valgrind and cross fingers it
finds something interesting.
-- Pete
> (gdb) info locals
> i = 0
> new_list_index = 0
> tmp_completion_list = {0x0 <repeats 256 times>}
> sm_p = (PINT_client_sm *) 0x0
> __PRETTY_FUNCTION__ = "completion_list_retrieve_completed"
> (gdb) print op_id_array
> $5 = (PVFS_sys_op_id *) 0xfff7d710
> (gdb) print op_id_array[0]
> $7 = 34
>
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers