Re: [Pvfs2-developers] server daemon hung

Sam Lang Wed, 20 Oct 2010 13:30:26 -0700

On Oct 20, 2010, at 2:58 PM, Bart Taylor wrote:

> Thanks so much for the quick patches! I have applied both and will grind on 
> them for a while.. 
> 
> It sounds like the first took care of catching an error code that slipped 
> through and the second should make sure an operation still exists and cleans 
> up a little better. Will the tracking of operations you added prevent those 
> I/O operations from backing up even if there are timeouts?


I think the I/O operations were eventually completing anyway, just not fast 
enough before the timeout was reached.  You might want to consider increasing 
your timeouts on this hardware -- it looks like the values you have are being 
hit frequently under heavy load.

> 
> I will look into that client error and post separately if it becomes a 
> problem.
> 
> The backtrace showed me 35 threads running. All of them were sitting in 
> pthread_cond_timedwait save three which were in __lll_lock_wait.

Ok, sounds like you sent the right backtrace.  Those other threads are I/O 
threads just waiting for something to do.
-sam

> 
> Thanks again,
> Bart.
> 
> 
> 
> 
> On Wed, Oct 20, 2010 at 9:58 AM, Sam Lang <[email protected]> wrote:
> 
> I think the reason the operation id appears invalid is simply because 
> sometime the operation completes before the cancel path gets a chance to 
> cancel it.  Attached is another patch (should be applied after the previous 
> one), which does some better checking of whether or not the operation still 
> exists (using id_gen_safe_register instead of id_gen_fast_register), and 
> keeps the operation pointer around until it really can be removed by the call 
> to test_all.  This should prevent unsafe accesses of already freed operations 
> and perform the correct logic should the operation get completed before the 
> cancel is called.
> 
> -sam
> 
> 
> 
> 
> 
> On Oct 20, 2010, at 10:19 AM, Sam Lang wrote:
> 
> >
> > Hi Bart,
> >
> > Attached is a patch that should fix the segv.  It looks like there might be 
> > multiple bugs causing this problem.  Working from the top down:
> >
> > 1.  I/O operations at the server are timing out.  This could just be due to 
> > server load or timeout values that are too small depending on the storage 
> > hardware involved.
> > 2.  The server is trying to cancel I/O operations that have timed out, and 
> > is following the cancel path down into the pint-mgmt code 
> > (PINT_manager_cancel).  The operation id given to that function is invalid, 
> > so PINT_manager_find_worker (called by PINT_manager_cancel) can't find the 
> > associated component to continue the cancel path, and returns an error code.
> > 3.  PINT_manager_cancel fails to check the error code, so the normal code 
> > path is followed and the segfault occurs by dereferencing the worker 
> > pointer.  The attached patch should fix this segfault, but won't fix the 
> > issue that the operation id appears to be invalid, so the I/O operations 
> > won't get cancelled.  The server will continue to run, but if you keep 
> > getting timeouts you will accumulate I/O operations at that server.
> > 4.  Without the patch, the segv causes the PVFS server signal handler to 
> > get invoked, which tries to cleanup and shutdown the server.  While trying 
> > to cancel the bmi unexpected messages, it tries to lock the job 
> > completion_mutex, which must have been locked already by the trove 
> > completion thread or by the cancel path described above.  I think that once 
> > the segfault is fixed, the mutex will eventually get unlocked, and you will 
> > be able to signal the server to shutdown properly.
> > 5.  Right now each kill signal is getting caught by the signal handler, 
> > which eventually tries to lock that same completion_mutex.
> >
> > I'm not sure why you see those device error messages on the client, but I 
> > think that's a separate problem.
> >
> > Are you directing the pvfs2-client log output to the system log?  If not, 
> > you might check the pvfs2-client log file.
> >
> > There should be multiple threads running on the server.  In gdb you can do 
> > 'thread apply all bt' to get the backtrace of all running threads.
> >
> > -sam
> >
> > <pint-mgmt-segv-fix.patch>
> >
> > On Oct 20, 2010, at 9:22 AM, Bart Taylor wrote:
> >
> >>
> >> Hey guys,
> >>
> >> We have run into an issue a couple of times now were one of our 2.8 server 
> >> daemons will hang and cannot be killed. This happened again last night and 
> >> I captured some info from it this time. This particular case has 4 server 
> >> nodes running 2.6.18-194 kernels. About the time a lot of access started, 
> >> there were some client timeout messages and several of these in dmesg:
> >>
> >> *****************************************************
> >> PVFS2 Device Error:  You cannot open the device file
> >> /dev/pvfs2-req more than once.  Please make sure that
> >> there are no instances of a program using this device
> >> currently running. (You must verify this!)
> >> For example, you can use the lsof program as follows:
> >> 'lsof | grep pvfs2-req' (run this as root)
> >>  open_access_count = 1
> >> *****************************************************
> >>
> >> A pvfs2-ping hung on the second server, and I discovered this in syslog on 
> >> that server. The messages begin when the server was started yesterday and 
> >> end when the errors stopped.
> >>
> >> Oct 19 14:39:52 server2 PVFS2: [E]  PVFS2 server got signal 15 
> >> (server_status_flag: 507903)
> >> Oct 19 14:39:53 server2 kernel: kjournald starting.  Commit interval 5 
> >> seconds
> >> Oct 19 14:39:53 server2 kernel: EXT3 FS on sdb, internal journal
> >> Oct 19 14:39:53 server2 kernel: EXT3-fs: mounted filesystem with ordered 
> >> data mode.
> >> Oct 19 14:39:58 server2 PVFS2: [D] PVFS2 Server version 2.8.2 starting.
> >> Oct 19 17:32:57 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> >> cancelling flow operation, job_id: 121080.
> >> Oct 19 17:32:57 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> >> called on 0x13f52d20
> >> Oct 19 17:32:57 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> >> Oct 19 17:32:57 server2 PVFS2: [E] handle_io_error: flow proto error 
> >> cleanup started on 0x13f52d20: Operation cancelled (possibly due to 
> >> timeout)
> >> Oct 19 17:32:59 server2 PVFS2: [E] WARNING: Trove thread mgr cancel 
> >> failed, proceeding anyway.
> >> Oct 19 17:32:59 server2 last message repeated 3 times
> >> Oct 19 17:32:59 server2 PVFS2: [E] handle_io_error: flow proto 0x13f52d20 
> >> canceled 4 operations, will clean up.
> >> Oct 19 17:32:59 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> >> cancelling flow operation, job_id: 121085.
> >> Oct 19 17:32:59 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> >> called on 0x13f3c750
> >> Oct 19 17:32:59 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> >> Oct 19 17:32:59 server2 PVFS2: [E] handle_io_error: flow proto error 
> >> cleanup started on 0x13f3c750: Operation cancelled (possibly due to 
> >> timeout)
> >> Oct 19 17:32:59 server2 PVFS2: [E] WARNING: Trove thread mgr cancel 
> >> failed, proceeding anyway.
> >> Oct 19 17:32:59 server2 PVFS2: [E] trove_read_callback_fn: I/O error 
> >> occurred
> >> Oct 19 17:33:00 server2 PVFS2: [E] WARNING: Trove thread mgr cancel 
> >> failed, proceeding anyway.
> >> Oct 19 17:33:02 server2 PVFS2: [E] trove_read_callback_fn: I/O error 
> >> occurred
> >> Oct 19 17:33:03 server2 PVFS2: [E] WARNING: Trove thread mgr cancel 
> >> failed, proceeding anyway.
> >> Oct 19 17:33:03 server2 PVFS2: [E] WARNING: Trove thread mgr cancel 
> >> failed, proceeding anyway.
> >> Oct 19 17:33:04 server2 PVFS2: [E] handle_io_error: flow proto 0x13f3c750 
> >> canceled 4 operations, will clean up.
> >> Oct 19 17:33:06 server2 PVFS2: [E] trove_read_callback_fn: I/O error 
> >> occurred
> >> Oct 19 17:33:07 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> >> cancelling flow operation, job_id: 121979.
> >> Oct 19 17:33:07 server2 PVFS2: [E] trove_read_callback_fn: I/O error 
> >> occurred
> >> Oct 19 17:33:10 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> >> called on 0x2aaaac1d80e0
> >> Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto 0x13f52d20 
> >> error cleanup finished: Operation cancelled (possibly due to timeout)
> >> Oct 19 17:33:11 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> >> Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto error 
> >> cleanup started on 0x2aaaac1d80e0: Operation cancelled (possibly due to 
> >> timeout)
> >> Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto 
> >> 0x2aaaac1d80e0 canceled 1 operations, will clean up.
> >> Oct 19 17:33:12 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> >> cancelling flow operation, job_id: 122772.
> >> Oct 19 17:33:12 server2 PVFS2: [E] trove_read_callback_fn: I/O error 
> >> occurred
> >> Oct 19 17:33:12 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> >> called on 0x2aaaac4b2800
> >> Oct 19 17:33:12 server2 PVFS2: [E] handle_io_error: flow proto 
> >> 0x2aaaac1d80e0 error cleanup finished: Operation cancelled (possibly due 
> >> to timeout)
> >> Oct 19 17:33:14 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> >> Oct 19 17:33:15 server2 PVFS2: [E] handle_io_error: flow proto error 
> >> cleanup started on 0x2aaaac4b2800: Operation cancelled (possibly due to 
> >> timeout)
> >> Oct 19 17:33:15 server2 PVFS2: [E] WARNING: Trove thread mgr cancel 
> >> failed, proceeding anyway.
> >> Oct 19 17:33:17 server2 last message repeated 3 times
> >> Oct 19 17:33:17 server2 PVFS2: [E] handle_io_error: flow proto 
> >> 0x2aaaac4b2800 canceled 4 operations, will clean up.
> >> Oct 19 17:33:17 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> >> cancelling flow operation, job_id: 122777.
> >> Oct 19 17:33:17 server2 PVFS2: [E] trove_read_callback_fn: I/O error 
> >> occurred
> >> Oct 19 17:33:17 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> >> called on 0x2aaaac223d90
> >> Oct 19 17:33:17 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> >> Oct 19 17:33:17 server2 PVFS2: [E] handle_io_error: flow proto error 
> >> cleanup started on 0x2aaaac223d90: Operation cancelled (possibly due to 
> >> timeout)
> >> Oct 19 17:33:17 server2 PVFS2: [E] WARNING: Trove thread mgr cancel 
> >> failed, proceeding anyway.
> >> Oct 19 17:33:18 server2 PVFS2: [E] WARNING: Trove thread mgr cancel 
> >> failed, proceeding anyway.
> >> Oct 19 17:33:18 server2 PVFS2: [E] handle_io_error: flow proto 
> >> 0x2aaaac223d90 canceled 2 operations, will clean up.
> >> Oct 19 17:33:18 server2 PVFS2: [E] trove_read_callback_fn: I/O error 
> >> occurred
> >> Oct 19 17:33:18 server2 last message repeated 5 times
> >> Oct 19 17:33:18 server2 PVFS2: [E] handle_io_error: flow proto 0x13f3c750 
> >> error cleanup finished: Operation cancelled (possibly due to timeout)
> >> Oct 19 17:33:18 server2 PVFS2: [E] trove_read_callback_fn: I/O error 
> >> occurred
> >> Oct 19 17:33:19 server2 PVFS2: [E] handle_io_error: flow proto 
> >> 0x2aaaac4b2800 error cleanup finished: Operation cancelled (possibly due 
> >> to timeout)
> >> Oct 19 17:33:19 server2 PVFS2: [E] trove_read_callback_fn: I/O error 
> >> occurred
> >> Oct 19 17:33:19 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> >> cancelling flow operation, job_id: 123566.
> >> Oct 19 17:33:20 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> >> called on 0x2aaaac1bacf0
> >> Oct 19 17:33:20 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> >> Oct 19 17:33:20 server2 PVFS2: [E] handle_io_error: flow proto error 
> >> cleanup started on 0x2aaaac1bacf0: Operation cancelled (possibly due to 
> >> timeout)
> >> Oct 19 17:33:20 server2 PVFS2: [E] PINT_manager_find_worker: operation 
> >> posted with a queue id (30) that isn't held by this manager
> >>
> >>
> >> I tried killing the server to get it restarted, but it cannot be killed. I 
> >> attached GDB and found this:
> >>
> >>
> >> Program received signal SIGINT, Interrupt.
> >> 0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
> >> (gdb) where
> >> #0  0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
> >> #1  0x0000003d70a08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
> >> #2  0x0000003d70a08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
> >> #3  0x0000000000452579 in gen_posix_mutex_lock (mut=0x6da360) at 
> >> ../pvfs2_src/src/common/gen-locks/gen-locks.c:45
> >> #4  0x0000000000431e67 in job_bmi_unexp_cancel (id=123372) at 
> >> ../pvfs2_src/src/io/job/job.c:811
> >> #5  0x0000000000413b06 in server_purge_unexpected_recv_machines () at 
> >> ../pvfs2_src/src/server/pvfs2-server.c:1879
> >> #6  0x000000000041340c in server_sig_handler (sig=15) at 
> >> ../pvfs2_src/src/server/pvfs2-server.c:1652
> >> #7  <signal handler called>
> >> #8  0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
> >> #9  0x0000003d70a08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
> >> #10 0x0000003d70a08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
> >> #11 0x0000000000452579 in gen_posix_mutex_lock (mut=0x6da360) at 
> >> ../pvfs2_src/src/common/gen-locks/gen-locks.c:45
> >> #12 0x0000000000431e67 in job_bmi_unexp_cancel (id=123372) at 
> >> ../pvfs2_src/src/io/job/job.c:811
> >> #13 0x0000000000413b06 in server_purge_unexpected_recv_machines () at 
> >> ../pvfs2_src/src/server/pvfs2-server.c:1879
> >> #14 0x000000000041340c in server_sig_handler (sig=11) at 
> >> ../pvfs2_src/src/server/pvfs2-server.c:1652
> >> #15 <signal handler called>
> >> #16 0x000000000049a367 in PINT_manager_cancel (manager=0x13e56a70, 
> >> op_id=46912520503856) at ../pvfs2_src/src/common/mgmt/pint-mgmt.c:1162
> >> #17 0x0000000000487b29 in dbpf_bstream_direct_cancel (coll_id=445355253, 
> >> cancel_id=46912520503472, context_id=0)
> >>    at ../pvfs2_src/src/io/trove/trove-dbpf/dbpf-bstream-direct.c:1249
> >> #18 0x000000000047cbce in dbpf_dspace_cancel (coll_id=445355253, 
> >> id=46912520503472, context_id=0)
> >>    at ../pvfs2_src/src/io/trove/trove-dbpf/dbpf-dspace.c:1551
> >> #19 0x000000000043ed18 in trove_dspace_cancel (coll_id=445355253, 
> >> id=46912520503472, context_id=0) at ../pvfs2_src/src/io/trove/trove.c:1045
> >> #20 0x0000000000439a65 in PINT_thread_mgr_trove_cancel (id=46912520503472, 
> >> fs_id=445355253, user_ptr=0x2aaaac196670)
> >>    at ../pvfs2_src/src/io/job/thread-mgr.c:726
> >> #21 0x000000000047052f in cancel_pending_trove (list=0x2aaaac005e50, 
> >> coll_id=445355253)
> >>    at 
> >> ../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:2160
> >> #22 0x00000000004700af in handle_io_error (error_code=-1610613121, 
> >> q_item=0x0, flow_data=0x2aaaac005c00)
> >>    at 
> >> ../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:2039
> >> #23 0x000000000046bbd5 in fp_multiqueue_cancel (flow_d=0x2aaaac1bacf0) at 
> >> ../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:480
> >> #24 0x000000000043ae1f in PINT_flow_cancel (flow_d=0x2aaaac1bacf0) at 
> >> ../pvfs2_src/src/io/flow/flow.c:426
> >> #25 0x0000000000432706 in job_flow_cancel (id=123566, context_id=0) at 
> >> ../pvfs2_src/src/io/job/job.c:1395
> >> #26 0x000000000043a41f in job_time_mgr_expire () at 
> >> ../pvfs2_src/src/io/job/job-time-mgr.c:277
> >> #27 0x0000000000426af2 in job_timer_do_work (smcb=0x13fb78a0, 
> >> js_p=0x13e3ecb0) at ../pvfs2_src/src/server/job-timer.sm:67
> >> #28 0x0000000000464591 in PINT_state_machine_invoke (smcb=0x13fb78a0, 
> >> r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:132
> >> #29 0x000000000046498c in PINT_state_machine_next (smcb=0x13fb78a0, 
> >> r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:309
> >> #30 0x00000000004649cb in PINT_state_machine_continue (smcb=0x13fb78a0, 
> >> r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:327
> >> #31 0x0000000000410ef2 in main (argc=6, argv=0x7ffff6a54f28) at 
> >> ../pvfs2_src/src/server/pvfs2-server.c:413
> >>
> >>
> >> From the messages you probably deduced that we are using directio. This 
> >> has happened twice now that I know of. It looks like the canellation path 
> >> might be responsible, but I would appreciate someone taking a look. Can I 
> >> provide any more information while the server is still in this state? I 
> >> will need to fix it soon.
> >>
> >> Thanks,
> >> Bart.
> >>
> >> _______________________________________________
> >> Pvfs2-developers mailing list
> >> [email protected]
> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >
> 
> 
> 


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] server daemon hung

Reply via email to