Here's another one:

http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client.log-a5-n5-abort

When I run pvfs2-client-core with no arguments, it seems to work fine.

If you can also take a look at:

http://www.scl.ameslab.gov/~troy/pvfs/hangs/

These are all instances of the pvfs-client-core hanging on PPC64 while I'm doing a DD of a large file.

Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 17:14 -0600:
I'm getting a sig11 with the power5 client.. Here's a bunch of debugging info.. now where do I got next?

[D 16:57:30.033896] BMI_post_sendunexpected_list: addr: 269231512, count: 1, tot
al_size: 52, tag: 15
[D 16:57:30.033926]    element 0: offset: 0x1013d390, size: 52
[D 16:57:30.033955] post_send: sq 0x100c7f60 len 52 peer da13:3345.
[D 16:57:30.033984] encourage_send_waiting_buffer: sq 0x100c7f60 sent EAGER len
52.
[D 16:57:30.034019] ib_check_cq: send to da13:3345 completed locally: sq 0x100c7
f60 -> SQ_WAITING_USER_TEST.
[D 16:57:30.034047] test_sq: sq 0x100c7f60 completed 52 to da13:3345.
[D 16:57:30.034162] ib_check_cq: recv from da13:3345 len 104 type MSG_EAGER_SEND
credit 1.
[D 16:57:30.034191] encourage_recv_incoming: recv eager len 104.
[D 16:57:30.034216] encourage_recv_incoming: matched rq 0x100d8790 now RQ_EAGER_
WAITING_USER_TEST.
[D 16:57:30.034246] encourage_recv_incoming: early registration not needed, dere g after eager. [D 16:57:30.034276] memcache_deregister: dec refcount [0] 0x10146930 len 8224 (v ia 0x10146930 len 8224) refcnt now 1.
[D 16:57:30.034307] test_rq: rq 0x100d8790 completed 88 from da13:3345.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -134410208 (LWP 6302)]
completion_list_retrieve_completed (op_id_array=0xfff7d710,
   user_ptr_array=0xfff7d310, error_code_array=0xfff7d410, limit=64,
   out_count=0xfff7d2f0) at ../src/client/sysint/client-state-machine.c:141
141                 op_id_array[i] = sm_p->sys_op_id;
(gdb)
(gdb)
(gdb)
(gdb)
(gdb) bt
#0  completion_list_retrieve_completed (op_id_array=0xfff7d710,
   user_ptr_array=0xfff7d310, error_code_array=0xfff7d410, limit=64,
   out_count=0xfff7d2f0) at ../src/client/sysint/client-state-machine.c:141
#1 0x100441b4 in PINT_client_state_machine_testsome (op_id_array=0xfff7d710,
   op_count=0xfff7d2f0, user_ptr_array=0xfff7d310,
   error_code_array=0xfff7d410, timeout_ms=10)
   at ../src/client/sysint/client-state-machine.c:694
#2  0x10010c00 in process_vfs_requests ()
   at ../src/apps/kernel/linux/pvfs2-client-core.c:2943
#3  0x100120f4 in main (argc=<value optimized out>, argv=0xfff7dc74)
   at ../src/apps/kernel/linux/pvfs2-client-core.c:3379
(gdb) print sm_p
$1 = (PINT_client_sm *) 0x0
(gdb)
$2 = (PINT_client_sm *) 0x0
(gdb) list
136             assert(smcb);
137
138             if (i < limit)
139             {
140                 sm_p = PINT_sm_frame(smcb, PINT_FRAME_CURRENT);
141                 op_id_array[i] = sm_p->sys_op_id;
142                 error_code_array[i] = sm_p->error_code;
143
144                 if (user_ptr_array)
145                 {
(gdb) print smcb
No symbol "smcb" in current context.
(gdb) list -
126
127         gen_mutex_lock(&s_completion_list_mutex);
128         for(i = 0; i < s_completion_list_index; i++)
129         {
130             if (s_completion_list[i] == NULL)
131             {
132                 continue;
133             }
134
135             smcb = s_completion_list[i];
(gdb) print s_completion_list[0]
$3 = (PINT_smcb *) 0x100da450
(gdb) print *s_completion_list[0]
$4 = {stackptr = 0, current_state = 0x100b0068, state_stack = {0x100aff90,
   0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, frames = {next = 0x100da478,
   prev = 0x100da478}, frame_count = 1,
 op_get_state_machine = 0x10043b80 <client_op_state_get_machine>, op = 5,
 op_id = 0, parent_smcb = 0x0, op_terminate = 1, op_cancelled = 0,
 children_running = 0, op_completed = 1, context = 0,
terminate_fn = 0x100452a0 <client_state_machine_terminate>, user_ptr = 0x0}

All I get from this is that the frames qlist has a single entry,
state_stack[4].  Not sure how it got so deep into there.  Likely
some sort of memory corruption, or we have a fairly major
undiscovered SM bug on our hands.

If you can repeat this at will, doing a -g build and running with
all debugging would be especially nice.  Maybe the debug log would
show something curious.

The other approach is to run under valgrind and cross fingers it
finds something interesting.

                -- Pete

(gdb) info locals
i = 0
new_list_index = 0
tmp_completion_list = {0x0 <repeats 256 times>}
sm_p = (PINT_client_sm *) 0x0
__PRETTY_FUNCTION__ = "completion_list_retrieve_completed"
(gdb) print op_id_array
$5 = (PVFS_sys_op_id *) 0xfff7d710
(gdb) print op_id_array[0]
$7 = 34


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to