Gave the new version in trunk a test and it seems to handle the 
remount problem correctly. Thanks for the cleanup and getting it 
applied!

Michael

On Thu, Feb 11, 2010 at 06:07:23PM -0500, Phil Carns wrote:
> Thanks for the new patch and for the explanation.  I checked a modified 
> version of your patch into trunk.  Can you try it out and let me know if 
> it works on your end?
> 
> I made some changes to how client-core exits (and how pvfs2-client 
> detects it) to make things a little cleaner.  On my box the client-core 
> was segfaulting as it shut down because it used gossip sys_finalize(). 
> Fixing that prevented the pvfs2-client from restarting 
> pvfs2-client-core, though.  I added a special return code from 
> pvfs2-client-core instead to explicitly tell pvfs2-client to try again.
> 
> -Phil
> 
> Michael Moore wrote:
> > Attached is the cvs diff with the requested flags. I noticed how useless 
> > the previous patch format I used was when I was applying the cancel I/O 
> > patch :)
> > 
> > It does lead to a pvfs2-client-core restart loop if the connection to 
> > server never comes back. However, the loop will be tempered by the 
> > BMI timeout and retry counts so it should be a reasonably long loop (I 
> > don't recall the defaults off hand, but should only be every couple 
> > minutes).
> > 
> > Michael
> > 
> > On Mon, Feb 08, 2010 at 01:51:16PM -0500, Phil Carns wrote:
> >> Hi Michael,
> >>
> >> Could you regenerate this patch with "diff -Naupr"  (or "cvs diff 
> >> -Naup")?  The -u in particular makes it a little easier to read/apply.
> >>
> >> I think this is the same issue as described in this open trac entry, 
> >> which would be great to knock out:
> >>
> >> https://trac.mcs.anl.gov/projects/pvfs/ticket/66
> >>
> >> I haven't traced through the code yet to look myself, but is there any 
> >> chance of the pvfs2-client-core getting stuck in a restart loop?
> >>
> >> -Phil
> >>
> >> Michael Moore wrote:
> >>> Attached is a patch against head for the issue. The comments largely
> >>> describe what's going on. If pvfs2-client-core is re-started due to a
> >>> segfault with a previously mounted PVFS filesystem any requests will
> >>> cause the process to spin.
> >>>
> >>> The patch adds a check at the end of the process_vfs_request
> >>> while(s_client_is_processing) loop to check if mount_complete is set to
> >>> failed. If so, it exits pvfs2-client-core with a non-zero value so a new
> >>> client-core will get restarted and mount/add the filesystem if
> >>> exec_remount completes successfully. If everything looks okay can you 
> >>> apply it to head?
> >>>
> >>> Thanks,
> >>> Michael
> >>>
> >>> On Mon, Feb 01, 2010 at 02:09:46PM -0500, Michael Moore wrote:
> >>>> On Mon, Feb 01, 2010 at 02:04:22PM -0500, Michael Moore wrote:
> >>>>> We recently saw some strange behvior in pvfs2-client-core when a server 
> >>>>> goes away 
> >>>>> (via segfault) and the client is unable to re-mount the filesystem. The 
> >>>>> pvfs2-client-core process takes up 100% of a core just spinning on 
> >>>>> process_vfs_request -> PVFS_sys_testsome and subsequent calls. Full 
> >>>>> backtrace 
> >>>>> follows. 
> >>>>>
> >>>>> In looking at the code in pvfs2-client-core it seems to  assume that 
> >>>>> the re-mount 
> >>>>> will always succeed (around line 3579). However, I don't know that it's 
> >>>>> the root 
> >>>>> cause of the issue. I'll continue looking but wondered if anyone had 
> >>>>> ideas on this. 
> >>>>> This appears to be re-creatable by:
> >>>>> 1) cleanly mounting and using the filesystem for some I/O
> >>>>> 2) either killing the servers or adding iptables rules to the client to 
> >>>>> reject 
> >>>>> traffic to the server,
> >>>>> 3) Attemping I/O from the client
> >>>> I neglected to mention the pvfs2-client-core must be killed after 
> >>>> attempting 
> >>>> I/O traffic to the 'failed' server as I only saw this behavior after the 
> >>>> client 
> >>>> core restarts. I'm still digging into the reason the client core 
> >>>> segfaulted after a 
> >>>> failed I/O flow.
> >>>>
> >>>> Michael
> >>>>
> >>>>> The operation correctly dies with connection refused but the client 
> >>>>> begins to spin 
> >>>>> taking up CPU.
> >>>>>
> >>>>> (gdb) bt
> >>>>> #0  0x00511402 in __kernel_vsyscall ()
> >>>>> #1  0x001d8023 in poll () from /lib/libc.so.6
> >>>>> #2  0x0082ebed in PINT_dev_test_unexpected (incount=5, 
> >>>>> outcount=0xbf84e4f8, info_array=0x8bbbc0, max_idle_time=10) at 
> >>>>> src/io/dev/pint-dev.c:398
> >>>>> #3  0x00848f50 in PINT_thread_mgr_dev_push (max_idle_time=10) at 
> >>>>> src/io/job/thread-mgr.c:332
> >>>>> #4  0x00844caf in do_one_work_cycle_all (idle_time_ms=10) at 
> >>>>> src/io/job/job.c:5238
> >>>>> #5  0x008454a1 in job_testcontext (out_id_array_p=0xbf8515f0, 
> >>>>> inout_count_p=0xbf8521f4, returned_user_ptr_array=0xbf851df4, 
> >>>>> out_status_array_p=0xbf84e5f0, timeout_ms=10, context_id=0) at 
> >>>>> src/io/job/job.c:4273
> >>>>> #6  0x00857dba in PINT_client_state_machine_testsome 
> >>>>> (op_id_array=0xbf8522a8, op_count=0xbf8528c4, 
> >>>>> user_ptr_array=0xbf8527a8, error_code_array=0xbf8526a8, timeout_ms=10) 
> >>>>> at src/client/sysint/client-state-machine.c:756
> >>>>> #7  0x00857fb9 in PVFS_sys_testsome (op_id_array=0xbf8522a8, 
> >>>>> op_count=0xbf8528c4, user_ptr_array=0xbf8527a8, 
> >>>>> error_code_array=0xbf8526a8, timeout_ms=10) at 
> >>>>> src/client/sysint/client-state-machine.c:971
> >>>>> #8  0x08050cd6 in process_vfs_requests () at 
> >>>>> src/apps/kernel/linux/pvfs2-client-core.c:3119
> >>>>> #9  0x08052658 in main (argc=10, argv=0xbf852a74) at 
> >>>>> src/apps/kernel/linux/pvfs2-client-core.c:3579
> >>>>>
> >>>>> Let me now if more information is needed, thanks for the input!
> >>>>>
> >>>>> Michael
> >>>>> _______________________________________________
> >>>>> Pvfs2-developers mailing list
> >>>>> [email protected]
> >>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >>>> _______________________________________________
> >>>> Pvfs2-developers mailing list
> >>>> [email protected]
> >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >>>>
> >>>> ------------------------------------------------------------------------
> >>>>
> >>>> _______________________________________________
> >>>> Pvfs2-developers mailing list
> >>>> [email protected]
> >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> 
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to