Gave the new version in trunk a test and it seems to handle the remount problem correctly. Thanks for the cleanup and getting it applied!
Michael On Thu, Feb 11, 2010 at 06:07:23PM -0500, Phil Carns wrote: > Thanks for the new patch and for the explanation. I checked a modified > version of your patch into trunk. Can you try it out and let me know if > it works on your end? > > I made some changes to how client-core exits (and how pvfs2-client > detects it) to make things a little cleaner. On my box the client-core > was segfaulting as it shut down because it used gossip sys_finalize(). > Fixing that prevented the pvfs2-client from restarting > pvfs2-client-core, though. I added a special return code from > pvfs2-client-core instead to explicitly tell pvfs2-client to try again. > > -Phil > > Michael Moore wrote: > > Attached is the cvs diff with the requested flags. I noticed how useless > > the previous patch format I used was when I was applying the cancel I/O > > patch :) > > > > It does lead to a pvfs2-client-core restart loop if the connection to > > server never comes back. However, the loop will be tempered by the > > BMI timeout and retry counts so it should be a reasonably long loop (I > > don't recall the defaults off hand, but should only be every couple > > minutes). > > > > Michael > > > > On Mon, Feb 08, 2010 at 01:51:16PM -0500, Phil Carns wrote: > >> Hi Michael, > >> > >> Could you regenerate this patch with "diff -Naupr" (or "cvs diff > >> -Naup")? The -u in particular makes it a little easier to read/apply. > >> > >> I think this is the same issue as described in this open trac entry, > >> which would be great to knock out: > >> > >> https://trac.mcs.anl.gov/projects/pvfs/ticket/66 > >> > >> I haven't traced through the code yet to look myself, but is there any > >> chance of the pvfs2-client-core getting stuck in a restart loop? > >> > >> -Phil > >> > >> Michael Moore wrote: > >>> Attached is a patch against head for the issue. The comments largely > >>> describe what's going on. If pvfs2-client-core is re-started due to a > >>> segfault with a previously mounted PVFS filesystem any requests will > >>> cause the process to spin. > >>> > >>> The patch adds a check at the end of the process_vfs_request > >>> while(s_client_is_processing) loop to check if mount_complete is set to > >>> failed. If so, it exits pvfs2-client-core with a non-zero value so a new > >>> client-core will get restarted and mount/add the filesystem if > >>> exec_remount completes successfully. If everything looks okay can you > >>> apply it to head? > >>> > >>> Thanks, > >>> Michael > >>> > >>> On Mon, Feb 01, 2010 at 02:09:46PM -0500, Michael Moore wrote: > >>>> On Mon, Feb 01, 2010 at 02:04:22PM -0500, Michael Moore wrote: > >>>>> We recently saw some strange behvior in pvfs2-client-core when a server > >>>>> goes away > >>>>> (via segfault) and the client is unable to re-mount the filesystem. The > >>>>> pvfs2-client-core process takes up 100% of a core just spinning on > >>>>> process_vfs_request -> PVFS_sys_testsome and subsequent calls. Full > >>>>> backtrace > >>>>> follows. > >>>>> > >>>>> In looking at the code in pvfs2-client-core it seems to assume that > >>>>> the re-mount > >>>>> will always succeed (around line 3579). However, I don't know that it's > >>>>> the root > >>>>> cause of the issue. I'll continue looking but wondered if anyone had > >>>>> ideas on this. > >>>>> This appears to be re-creatable by: > >>>>> 1) cleanly mounting and using the filesystem for some I/O > >>>>> 2) either killing the servers or adding iptables rules to the client to > >>>>> reject > >>>>> traffic to the server, > >>>>> 3) Attemping I/O from the client > >>>> I neglected to mention the pvfs2-client-core must be killed after > >>>> attempting > >>>> I/O traffic to the 'failed' server as I only saw this behavior after the > >>>> client > >>>> core restarts. I'm still digging into the reason the client core > >>>> segfaulted after a > >>>> failed I/O flow. > >>>> > >>>> Michael > >>>> > >>>>> The operation correctly dies with connection refused but the client > >>>>> begins to spin > >>>>> taking up CPU. > >>>>> > >>>>> (gdb) bt > >>>>> #0 0x00511402 in __kernel_vsyscall () > >>>>> #1 0x001d8023 in poll () from /lib/libc.so.6 > >>>>> #2 0x0082ebed in PINT_dev_test_unexpected (incount=5, > >>>>> outcount=0xbf84e4f8, info_array=0x8bbbc0, max_idle_time=10) at > >>>>> src/io/dev/pint-dev.c:398 > >>>>> #3 0x00848f50 in PINT_thread_mgr_dev_push (max_idle_time=10) at > >>>>> src/io/job/thread-mgr.c:332 > >>>>> #4 0x00844caf in do_one_work_cycle_all (idle_time_ms=10) at > >>>>> src/io/job/job.c:5238 > >>>>> #5 0x008454a1 in job_testcontext (out_id_array_p=0xbf8515f0, > >>>>> inout_count_p=0xbf8521f4, returned_user_ptr_array=0xbf851df4, > >>>>> out_status_array_p=0xbf84e5f0, timeout_ms=10, context_id=0) at > >>>>> src/io/job/job.c:4273 > >>>>> #6 0x00857dba in PINT_client_state_machine_testsome > >>>>> (op_id_array=0xbf8522a8, op_count=0xbf8528c4, > >>>>> user_ptr_array=0xbf8527a8, error_code_array=0xbf8526a8, timeout_ms=10) > >>>>> at src/client/sysint/client-state-machine.c:756 > >>>>> #7 0x00857fb9 in PVFS_sys_testsome (op_id_array=0xbf8522a8, > >>>>> op_count=0xbf8528c4, user_ptr_array=0xbf8527a8, > >>>>> error_code_array=0xbf8526a8, timeout_ms=10) at > >>>>> src/client/sysint/client-state-machine.c:971 > >>>>> #8 0x08050cd6 in process_vfs_requests () at > >>>>> src/apps/kernel/linux/pvfs2-client-core.c:3119 > >>>>> #9 0x08052658 in main (argc=10, argv=0xbf852a74) at > >>>>> src/apps/kernel/linux/pvfs2-client-core.c:3579 > >>>>> > >>>>> Let me now if more information is needed, thanks for the input! > >>>>> > >>>>> Michael > >>>>> _______________________________________________ > >>>>> Pvfs2-developers mailing list > >>>>> [email protected] > >>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > >>>> _______________________________________________ > >>>> Pvfs2-developers mailing list > >>>> [email protected] > >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > >>>> > >>>> ------------------------------------------------------------------------ > >>>> > >>>> _______________________________________________ > >>>> Pvfs2-developers mailing list > >>>> [email protected] > >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
