Re: [Pvfs2-developers] pvfs2-client-core behavior on failed remount

Phil Carns Thu, 11 Feb 2010 15:08:18 -0800

Thanks for the new patch and for the explanation. I checked a modifiedversion of your patch into trunk. Can you try it out and let me know ifit works on your end?

I made some changes to how client-core exits (and how pvfs2-clientdetects it) to make things a little cleaner. On my box the client-corewas segfaulting as it shut down because it used gossip sys_finalize().Fixing that prevented the pvfs2-client from restartingpvfs2-client-core, though. I added a special return code frompvfs2-client-core instead to explicitly tell pvfs2-client to try again.


-Phil

Michael Moore wrote:

Attached is the cvs diff with the requested flags. I noticed how uselessthe previous patch format I used was when I was applying the cancel I/Opatch :)
It does lead to a pvfs2-client-core restart loop if the connection toserver never comes back. However, the loop will be tempered by theBMI timeout and retry counts so it should be a reasonably long loop (Idon't recall the defaults off hand, but should only be every coupleminutes).
Michael

On Mon, Feb 08, 2010 at 01:51:16PM -0500, Phil Carns wrote:
Hi Michael,
Could you regenerate this patch with "diff -Naupr" (or "cvs diff-Naup")? The -u in particular makes it a little easier to read/apply.
I think this is the same issue as described in this open trac entry,which would be great to knock out:
https://trac.mcs.anl.gov/projects/pvfs/ticket/66
I haven't traced through the code yet to look myself, but is there anychance of the pvfs2-client-core getting stuck in a restart loop?
-Phil

Michael Moore wrote:
Attached is a patch against head for the issue. The comments largely
describe what's going on. If pvfs2-client-core is re-started due to a
segfault with a previously mounted PVFS filesystem any requests will
cause the process to spin.

The patch adds a check at the end of the process_vfs_request
while(s_client_is_processing) loop to check if mount_complete is set to
failed. If so, it exits pvfs2-client-core with a non-zero value so a new
client-core will get restarted and mount/add the filesystem if
exec_remount completes successfully. If everything looks okay can youapply it to head?
Thanks,
Michael

On Mon, Feb 01, 2010 at 02:09:46PM -0500, Michael Moore wrote:
On Mon, Feb 01, 2010 at 02:04:22PM -0500, Michael Moore wrote:
We recently saw some strange behvior in pvfs2-client-core when a server goes away(via segfault) and the client is unable to re-mount the filesystem. Thepvfs2-client-core process takes up 100% of a core just spinning onprocess_vfs_request -> PVFS_sys_testsome and subsequent calls. Full backtracefollows.In looking at the code in pvfs2-client-core it seems to assume that the re-mountwill always succeed (around line 3579). However, I don't know that it's the rootcause of the issue. I'll continue looking but wondered if anyone had ideas on this.This appears to be re-creatable by:
1) cleanly mounting and using the filesystem for some I/O
2) either killing the servers or adding iptables rules to the client to rejecttraffic to the server,
3) Attemping I/O from the client
I neglected to mention the pvfs2-client-core must be killed after attemptingI/O traffic to the 'failed' server as I only saw this behavior after the clientcore restarts. I'm still digging into the reason the client core segfaulted after afailed I/O flow.
Michael
The operation correctly dies with connection refused but the client begins to spintaking up CPU.
(gdb) bt
#0  0x00511402 in __kernel_vsyscall ()
#1  0x001d8023 in poll () from /lib/libc.so.6
#2  0x0082ebed in PINT_dev_test_unexpected (incount=5, outcount=0xbf84e4f8, 
info_array=0x8bbbc0, max_idle_time=10) at src/io/dev/pint-dev.c:398
#3  0x00848f50 in PINT_thread_mgr_dev_push (max_idle_time=10) at 
src/io/job/thread-mgr.c:332
#4  0x00844caf in do_one_work_cycle_all (idle_time_ms=10) at 
src/io/job/job.c:5238
#5  0x008454a1 in job_testcontext (out_id_array_p=0xbf8515f0, 
inout_count_p=0xbf8521f4, returned_user_ptr_array=0xbf851df4, 
out_status_array_p=0xbf84e5f0, timeout_ms=10, context_id=0) at 
src/io/job/job.c:4273
#6  0x00857dba in PINT_client_state_machine_testsome (op_id_array=0xbf8522a8, 
op_count=0xbf8528c4, user_ptr_array=0xbf8527a8, error_code_array=0xbf8526a8, 
timeout_ms=10) at src/client/sysint/client-state-machine.c:756
#7  0x00857fb9 in PVFS_sys_testsome (op_id_array=0xbf8522a8, 
op_count=0xbf8528c4, user_ptr_array=0xbf8527a8, error_code_array=0xbf8526a8, 
timeout_ms=10) at src/client/sysint/client-state-machine.c:971
#8  0x08050cd6 in process_vfs_requests () at 
src/apps/kernel/linux/pvfs2-client-core.c:3119
#9  0x08052658 in main (argc=10, argv=0xbf852a74) at 
src/apps/kernel/linux/pvfs2-client-core.c:3579

Let me now if more information is needed, thanks for the input!

Michael
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

------------------------------------------------------------------------

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] pvfs2-client-core behavior on failed remount

Reply via email to