Attached is a patch against head for the issue. The comments largely
describe what's going on. If pvfs2-client-core is re-started due to a
segfault with a previously mounted PVFS filesystem any requests will
cause the process to spin.

The patch adds a check at the end of the process_vfs_request
while(s_client_is_processing) loop to check if mount_complete is set to
failed. If so, it exits pvfs2-client-core with a non-zero value so a new
client-core will get restarted and mount/add the filesystem if
exec_remount completes successfully. If everything looks okay can you 
apply it to head?

Thanks,
Michael

On Mon, Feb 01, 2010 at 02:09:46PM -0500, Michael Moore wrote:
> On Mon, Feb 01, 2010 at 02:04:22PM -0500, Michael Moore wrote:
> > We recently saw some strange behvior in pvfs2-client-core when a server 
> > goes away 
> > (via segfault) and the client is unable to re-mount the filesystem. The 
> > pvfs2-client-core process takes up 100% of a core just spinning on 
> > process_vfs_request -> PVFS_sys_testsome and subsequent calls. Full 
> > backtrace 
> > follows. 
> > 
> > In looking at the code in pvfs2-client-core it seems to  assume that the 
> > re-mount 
> > will always succeed (around line 3579). However, I don't know that it's the 
> > root 
> > cause of the issue. I'll continue looking but wondered if anyone had ideas 
> > on this. 
> > This appears to be re-creatable by:
> > 1) cleanly mounting and using the filesystem for some I/O
> > 2) either killing the servers or adding iptables rules to the client to 
> > reject 
> > traffic to the server,
> > 3) Attemping I/O from the client
> 
> I neglected to mention the pvfs2-client-core must be killed after attempting 
> I/O traffic to the 'failed' server as I only saw this behavior after the 
> client 
> core restarts. I'm still digging into the reason the client core segfaulted 
> after a 
> failed I/O flow.
> 
> Michael
> 
> > 
> > The operation correctly dies with connection refused but the client begins 
> > to spin 
> > taking up CPU.
> > 
> > (gdb) bt
> > #0  0x00511402 in __kernel_vsyscall ()
> > #1  0x001d8023 in poll () from /lib/libc.so.6
> > #2  0x0082ebed in PINT_dev_test_unexpected (incount=5, outcount=0xbf84e4f8, 
> > info_array=0x8bbbc0, max_idle_time=10) at src/io/dev/pint-dev.c:398
> > #3  0x00848f50 in PINT_thread_mgr_dev_push (max_idle_time=10) at 
> > src/io/job/thread-mgr.c:332
> > #4  0x00844caf in do_one_work_cycle_all (idle_time_ms=10) at 
> > src/io/job/job.c:5238
> > #5  0x008454a1 in job_testcontext (out_id_array_p=0xbf8515f0, 
> > inout_count_p=0xbf8521f4, returned_user_ptr_array=0xbf851df4, 
> > out_status_array_p=0xbf84e5f0, timeout_ms=10, context_id=0) at 
> > src/io/job/job.c:4273
> > #6  0x00857dba in PINT_client_state_machine_testsome 
> > (op_id_array=0xbf8522a8, op_count=0xbf8528c4, user_ptr_array=0xbf8527a8, 
> > error_code_array=0xbf8526a8, timeout_ms=10) at 
> > src/client/sysint/client-state-machine.c:756
> > #7  0x00857fb9 in PVFS_sys_testsome (op_id_array=0xbf8522a8, 
> > op_count=0xbf8528c4, user_ptr_array=0xbf8527a8, 
> > error_code_array=0xbf8526a8, timeout_ms=10) at 
> > src/client/sysint/client-state-machine.c:971
> > #8  0x08050cd6 in process_vfs_requests () at 
> > src/apps/kernel/linux/pvfs2-client-core.c:3119
> > #9  0x08052658 in main (argc=10, argv=0xbf852a74) at 
> > src/apps/kernel/linux/pvfs2-client-core.c:3579
> > 
> > Let me now if more information is needed, thanks for the input!
> > 
> > Michael
> > _______________________________________________
> > Pvfs2-developers mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
diff -r --exclude CVS 
cvs-head-a/pvfs2/src/apps/kernel/linux/pvfs2-client-core.c 
cvs-head-b/pvfs2/src/apps/kernel/linux/pvfs2-client-core.c
132a133,135
> #define REMOUNT_NOTCOMPLETED    0
> #define REMOUNT_COMPLETED       1
> #define REMOUNT_FAILED          2
135c138,139
< static int remount_complete = 0;
---
> static int remount_complete = REMOUNT_NOTCOMPLETED;
> 
504a509,510
> 
>     /* if PINT_dev_remount fails set remount_complete appropriately */
507a514,518
>         remount_complete = REMOUNT_FAILED;
>     }
>     else
>     {
>         remount_complete = REMOUNT_COMPLETED;
509,510d519
< 
<     remount_complete = 1;
2845c2854
<     if (!remount_complete &&
---
>     if (remount_complete == REMOUNT_NOTCOMPLETED &&
3125a3135
> 
3271a3282,3304
> 
>         /* The status of the remount thread needs to be checked in the event 
>          * the remount fails on client-core startup. If this is the initial 
>          * startup then any mount requests will fail as expected and the 
>          * client-core will behave normally. However, if a mount was 
>          * previously successful (in a previous client-core incarnation) 
>          * client-core doesn't check if the remount succeeded before 
>          * handling the mount request and fs_add. Then any subsequent requests
>          * cause this thread spins around PINT_dev_test_unexpected.
>          *
>          * With the current structure of process_vfs_request, creating the 
>          * remount thread before entering the while loop, it seems exiting 
>          * client-core on a failed remount attempt is the most staight 
> forward 
>          * way to handle this case. Exiting will cause the parent to kickoff 
>          * another client-core and try the remount until it succeeds.
>          */
>         if( remount_complete == REMOUNT_FAILED )
>         {
>             gossip_debug(GOSSIP_CLIENTCORE_DEBUG,
>                          "%s: remount not completed successfully, no longer "
>                          "handling requests.\n", __func__);
>             return -PVFS_EAGAIN; 
>         }
3586c3619
<     if (remount_complete)
---
>     if (remount_complete == REMOUNT_COMPLETED )
3621a3655,3660
>     /* if failed remount tell the parent it's something we did wrong. */
>     if( remount_complete != REMOUNT_COMPLETED )
>     {
>         gossip_err("exec_remount failed\n");
>         return 1;
>     }
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to