Troy,

Could you also sent the stacktrace from gdb where the segfault occurs? That's going to be the most useful info for us.

Thanks,
-sam

On Feb 13, 2008, at 4:24 PM, Troy Benjegerdes wrote:

http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client-log-crash
or
http://www.scl.ameslab.gov/~troy/pvfs/pvfs2-client-log-crash.gz

This looks pretty bad:

[D 16:10:47.741903] [SM Entering]: (0x10103cf0) sysdev_unexp_sm:post (status: 0) [D 16:10:47.741929] [SM Exiting]: (0x10103cf0) sysdev_unexp_sm:post (error code: 0), (action: DEFERRED) [D 16:10:47.741953] Posted PVFS_DEV_UNEXPECTED (375) (waiting for test)(-1073741839) [D 16:10:47.741977] [-] reposted unexp req [0x10117fc0] due to inlined completion
[D 16:10:47.742001] FRAME GET smcb 0x10119bd8 index 0 -> frame: NULL

Maybe we should have an assert in this code??

.. more info from a gdb trace..

(gdb) print smcb
$1 = (PINT_smcb *) 0x10119bd8
(gdb) print *smcb
$2 = {stackptr = 0, current_state = 0x100ea3e4, state_stack = {0x100ea3c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, frames = {next = 0x10119c00, prev = 0x10119c00}, frame_count = 1, op_get_state_machine = 0x1005cce0 <client_op_state_get_machine>, op = 5, op_id = 0, parent_smcb = 0x0, op_terminate = 1, op_cancelled = 0, children_running = 0, op_completed = 1, context = 0, terminate_fn = 0x1005ce74 <client_state_machine_terminate>, user_ptr = 0x0}
(gdb) print limit
$3 = 64
(gdb) print i
$4 = 0
(gdb) list PINT_sm_frame
586      * Params: pointer to smcb, stack index
587      * Returns: pointer to frame
588      * Synopsis: returns a frame off of the frame stack
589      */
590     void *PINT_sm_frame(struct PINT_smcb *smcb, int index)
591     {
592         struct PINT_frame_s *frame_entry;
593         struct qlist_head *next;
594
595         if(qlist_empty(&smcb->frames))
(gdb)
596         {
597             gossip_debug(GOSSIP_STATE_MACHINE_DEBUG,
598 "FRAME GET smcb %p index %d -> frame: NULL\n",
599                          smcb, index);
600             return NULL;
601         }
602         else
603         {
604             int i = 0;
605
(gdb)
606             next = smcb->frames.next;
607             while(i < index)
608             {
609                 next = next->next;
610             }
611 frame_entry = qlist_entry(next, struct PINT_frame_s, link);
612             return frame_entry->frame;
613         }
614     }
615
(gdb) print smcb->frames
$5 = {next = 0x10119c00, prev = 0x10119c00}



All I get from this is that the frames qlist has a single entry,
state_stack[4].  Not sure how it got so deep into there.  Likely
some sort of memory corruption, or we have a fairly major
undiscovered SM bug on our hands.

If you can repeat this at will, doing a -g build and running with
all debugging would be especially nice.  Maybe the debug log would
show something curious.

The other approach is to run under valgrind and cross fingers it
finds something interesting.

                -- Pete

(gdb) info locals
i = 0
new_list_index = 0
tmp_completion_list = {0x0 <repeats 256 times>}
sm_p = (PINT_client_sm *) 0x0
__PRETTY_FUNCTION__ = "completion_list_retrieve_completed"
(gdb) print op_id_array
$5 = (PVFS_sys_op_id *) 0xfff7d710
(gdb) print op_id_array[0]
$7 = 34


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to