Attached is a patch that, so far, has resolved the cancel IO issues 
we've been seeing. 

In completion_list_retrieve_completed a cancelled IO operation gets the 
base frame's user_ptr assigned to the user_ptr_array (which is the 
vfs_request array used back in process_vfs_request). This change stops 
the segfaults in process_vfs_requests. Then, in PINT_client_io_cancel 
the references to the contexts come from sm_base_p instead of sm_p. That 
ensures the context_count is correct and the context structure has the 
correct data. Without this change it's looking at a frame without this 
data.

Let me know if this looks okay, if so, can you apply it (or give me an 
okay to apply it) to head?

Thanks,
Michael

On Wed, Feb 03, 2010 at 04:38:43PM -0500, Michael Moore wrote:
> Hi Phil,
> 
> We're still seeing some issues around cancellation. One case I noticed, 
> but am finding hard to replicate, is when the sys-io state machine is in 
> the unstuff_xfer_msgpair state and has jumped to pvfs2_msgpairarray_sm. 
> For that state there will be a similar issue with a non I/O frame on the 
> stack, correct? The cases I've seen are when gibberish context counts 
> get printed such as the below and are followed by a segfault when 
> accessing cur_ctx.
> 
> [D 15:51:00.658599] PINT_client_io_cancel id 7707
> [D 15:51:00.658639] base frame is at index: -1
> [D 15:51:00.658648] PINT_client_io_cancel: sm_p->u.io.context_count: 8958368
> [D 15:51:00.658657] PINT_client_io_cancel: iteration i: 0
> 
> #0  PINT_client_io_cancel (id=7707) 
>     at src/client/sysint/client-state-machine.c:548
> #1  0x0804baf7 in service_operation_cancellation (vfs_request=0x85227e0) 
>     at src/apps/kernel/linux/pvfs2-client-core.c:407
> #2  0x0804f311 in handle_unexp_vfs_request (vfs_request=0x85227e0) 
>     at src/apps/kernel/linux/pvfs2-client-core.c:2980
> #3  0x08050f1f in process_vfs_requests ()
>     at src/apps/kernel/linux/pvfs2-client-core.c:3180
> #4  0x080527a8 in main (argc=10, argv=0xbfa14434)
>     at src/apps/kernel/linux/pvfs2-client-core.c:3593
> 
> I notice there are jumps for io_getattr and io_datafile_size which would 
> put other frames on the stack. Should the code after the small io check 
> just use the base frame pointer instead of sm_p? 
> 
> Thanks,
> Michael
> 
> On Wed, Jan 20, 2010 at 08:01:41AM -0600, Phil Carns wrote:
> > Great!  Thanks for testing it out.
> > 
> > -Phil
> > 
> > Michael Moore wrote:
> > > Thanks Phil, that appears to solve the problem! I tested it both against 
> > > head and orange branch and didn't see any of the infinite looping or 
> > > client segfaults. I tested it without any of the other changes so it 
> > > looks like that patch alone resolves the issue.
> > > 
> > > Michael
> > > 
> > > On Fri, Jan 15, 2010 at 03:28:54PM -0500, Phil Carns wrote:
> > >> Hi Michael,
> > >>
> > >> I just tried your test case on a clean trunk build here and was able to 
> > >> reproduce the pvfs2-client-core segfault 100% of the time on my box.
> > >>
> > >> The problem in a nutshell is that pvfs2-client-core was trying to cancel 
> > >> a small-io operation using logic that is only appropriate for a normal 
> > >> I/O operation, in turn causing some memory corruptions.
> > >>
> > >> Can you try out the fix and see if it solves the problem for you?  The 
> > >> patch is attached your you can pull it from cvs trunk.
> > >>
> > >> You might want to try that change by itself (without the op purged 
> > >> change) first and go from there.  Some of the other issues you ran into 
> > >> may have been an after-effect from the cancel problem.
> > >>
> > >> -Phil
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
diff -r --exclude CVS cvs-head-a/pvfs2/src/client/sysint/client-state-machine.c 
cvs-head-b/pvfs2/src/client/sysint/client-state-machine.c
156c156,177
<                 user_ptr_array[i] = (void *)sm_p->user_ptr;
---
>                 /* if this smcb has been set cancelled and is a PVFS_SYS_IO
>                  * state machine then stick the user_ptr of the base frame
>                  * in to the user_ptr_array instead of the standard sm_p 
>                  * user_ptr. This prevents segfaults back in 
>                  * process_vfs_requests which expects the pointer to be a 
>                  * vfs_request.
>                  */
>                 if( smcb->op_cancelled && smcb->op == PVFS_SYS_IO )
>                 {
>                     PINT_client_sm *sm_base_p = PINT_sm_frame(smcb,
>                                                  (-(smcb->frame_count -1)));
>                     assert(sm_base_p);
>                     gossip_debug(GOSSIP_CANCEL_DEBUG, "%s: assignment of "
>                                  "PVFS_SYS_IO user_ptr from sm_base_p(%p), "
>                                  "user_ptr(%p)\n", __func__, sm_base_p, 
>                                  sm_base_p->user_ptr);
>                     user_ptr_array[i] = sm_base_p->user_ptr;
>                 }
>                 else
>                 {
>                     user_ptr_array[i] = (void *)sm_p->user_ptr;
>                 }
516a538,544
>      *
>      * sm_base_p is used below instead of sm_p since it contains the correct
>      * counters and context pointers. In the event the control block only
>      * has one frame it behaves as it did previously. If the cancellation is 
>      * occuring when a non-IO frame has been pushed on the stack, which 
> doesn't
>      * have the expected structure, it doesn't cause a segfault but leaves
>      * it on the state machines stack.
518a547
>     assert(sm_base_p);
538c567
<     for(i = 0; i < sm_p->u.io.context_count; i++)
---
>     for(i = 0; i < sm_base_p->u.io.context_count; i++)
540c569
<         PINT_client_io_ctx *cur_ctx = &sm_p->u.io.contexts[i];
---
>         PINT_client_io_ctx *cur_ctx = &sm_base_p->u.io.contexts[i];
556c585
<             sm_p->u.io.total_cancellations_remaining++;
---
>             sm_base_p->u.io.total_cancellations_remaining++;
572c601
<             sm_p->u.io.total_cancellations_remaining++;
---
>             sm_base_p->u.io.total_cancellations_remaining++;
587c616
<             sm_p->u.io.total_cancellations_remaining++;
---
>             sm_base_p->u.io.total_cancellations_remaining++;
603c632
<             sm_p->u.io.total_cancellations_remaining++;
---
>             sm_base_p->u.io.total_cancellations_remaining++;
607,608c636,637
<                  "remaining: %d\n", sm_p,
<                  sm_p->u.io.total_cancellations_remaining);
---
>                  "remaining: %d\n", sm_base_p,
>                  sm_base_p->u.io.total_cancellations_remaining);
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to