Hi Phil,
We're still seeing some issues around cancellation. One case I noticed,
but am finding hard to replicate, is when the sys-io state machine is in
the unstuff_xfer_msgpair state and has jumped to pvfs2_msgpairarray_sm.
For that state there will be a similar issue with a non I/O frame on the
stack, correct? The cases I've seen are when gibberish context counts
get printed such as the below and are followed by a segfault when
accessing cur_ctx.
[D 15:51:00.658599] PINT_client_io_cancel id 7707
[D 15:51:00.658639] base frame is at index: -1
[D 15:51:00.658648] PINT_client_io_cancel: sm_p->u.io.context_count: 8958368
[D 15:51:00.658657] PINT_client_io_cancel: iteration i: 0
#0 PINT_client_io_cancel (id=7707)
at src/client/sysint/client-state-machine.c:548
#1 0x0804baf7 in service_operation_cancellation (vfs_request=0x85227e0)
at src/apps/kernel/linux/pvfs2-client-core.c:407
#2 0x0804f311 in handle_unexp_vfs_request (vfs_request=0x85227e0)
at src/apps/kernel/linux/pvfs2-client-core.c:2980
#3 0x08050f1f in process_vfs_requests ()
at src/apps/kernel/linux/pvfs2-client-core.c:3180
#4 0x080527a8 in main (argc=10, argv=0xbfa14434)
at src/apps/kernel/linux/pvfs2-client-core.c:3593
I notice there are jumps for io_getattr and io_datafile_size which would
put other frames on the stack. Should the code after the small io check
just use the base frame pointer instead of sm_p?
Thanks,
Michael
On Wed, Jan 20, 2010 at 08:01:41AM -0600, Phil Carns wrote:
> Great! Thanks for testing it out.
>
> -Phil
>
> Michael Moore wrote:
> > Thanks Phil, that appears to solve the problem! I tested it both against
> > head and orange branch and didn't see any of the infinite looping or
> > client segfaults. I tested it without any of the other changes so it
> > looks like that patch alone resolves the issue.
> >
> > Michael
> >
> > On Fri, Jan 15, 2010 at 03:28:54PM -0500, Phil Carns wrote:
> >> Hi Michael,
> >>
> >> I just tried your test case on a clean trunk build here and was able to
> >> reproduce the pvfs2-client-core segfault 100% of the time on my box.
> >>
> >> The problem in a nutshell is that pvfs2-client-core was trying to cancel
> >> a small-io operation using logic that is only appropriate for a normal
> >> I/O operation, in turn causing some memory corruptions.
> >>
> >> Can you try out the fix and see if it solves the problem for you? The
> >> patch is attached your you can pull it from cvs trunk.
> >>
> >> You might want to try that change by itself (without the op purged
> >> change) first and go from there. Some of the other issues you ran into
> >> may have been an after-effect from the cancel problem.
> >>
> >> -Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers