I think the reason the operation id appears invalid is simply because sometime the operation completes before the cancel path gets a chance to cancel it. Attached is another patch (should be applied after the previous one), which does some better checking of whether or not the operation still exists (using id_gen_safe_register instead of id_gen_fast_register), and keeps the operation pointer around until it really can be removed by the call to test_all. This should prevent unsafe accesses of already freed operations and perform the correct logic should the operation get completed before the cancel is called.
-sam
pint-mgmt-opid-cleanups.patch
Description: Binary data
On Oct 20, 2010, at 10:19 AM, Sam Lang wrote: > > Hi Bart, > > Attached is a patch that should fix the segv. It looks like there might be > multiple bugs causing this problem. Working from the top down: > > 1. I/O operations at the server are timing out. This could just be due to > server load or timeout values that are too small depending on the storage > hardware involved. > 2. The server is trying to cancel I/O operations that have timed out, and is > following the cancel path down into the pint-mgmt code (PINT_manager_cancel). > The operation id given to that function is invalid, so > PINT_manager_find_worker (called by PINT_manager_cancel) can't find the > associated component to continue the cancel path, and returns an error code. > 3. PINT_manager_cancel fails to check the error code, so the normal code > path is followed and the segfault occurs by dereferencing the worker pointer. > The attached patch should fix this segfault, but won't fix the issue that > the operation id appears to be invalid, so the I/O operations won't get > cancelled. The server will continue to run, but if you keep getting timeouts > you will accumulate I/O operations at that server. > 4. Without the patch, the segv causes the PVFS server signal handler to get > invoked, which tries to cleanup and shutdown the server. While trying to > cancel the bmi unexpected messages, it tries to lock the job > completion_mutex, which must have been locked already by the trove completion > thread or by the cancel path described above. I think that once the segfault > is fixed, the mutex will eventually get unlocked, and you will be able to > signal the server to shutdown properly. > 5. Right now each kill signal is getting caught by the signal handler, which > eventually tries to lock that same completion_mutex. > > I'm not sure why you see those device error messages on the client, but I > think that's a separate problem. > > Are you directing the pvfs2-client log output to the system log? If not, you > might check the pvfs2-client log file. > > There should be multiple threads running on the server. In gdb you can do > 'thread apply all bt' to get the backtrace of all running threads. > > -sam > > <pint-mgmt-segv-fix.patch> > > On Oct 20, 2010, at 9:22 AM, Bart Taylor wrote: > >> >> Hey guys, >> >> We have run into an issue a couple of times now were one of our 2.8 server >> daemons will hang and cannot be killed. This happened again last night and I >> captured some info from it this time. This particular case has 4 server >> nodes running 2.6.18-194 kernels. About the time a lot of access started, >> there were some client timeout messages and several of these in dmesg: >> >> ***************************************************** >> PVFS2 Device Error: You cannot open the device file >> /dev/pvfs2-req more than once. Please make sure that >> there are no instances of a program using this device >> currently running. (You must verify this!) >> For example, you can use the lsof program as follows: >> 'lsof | grep pvfs2-req' (run this as root) >> open_access_count = 1 >> ***************************************************** >> >> A pvfs2-ping hung on the second server, and I discovered this in syslog on >> that server. The messages begin when the server was started yesterday and >> end when the errors stopped. >> >> Oct 19 14:39:52 server2 PVFS2: [E] PVFS2 server got signal 15 >> (server_status_flag: 507903) >> Oct 19 14:39:53 server2 kernel: kjournald starting. Commit interval 5 >> seconds >> Oct 19 14:39:53 server2 kernel: EXT3 FS on sdb, internal journal >> Oct 19 14:39:53 server2 kernel: EXT3-fs: mounted filesystem with ordered >> data mode. >> Oct 19 14:39:58 server2 PVFS2: [D] PVFS2 Server version 2.8.2 starting. >> Oct 19 17:32:57 server2 PVFS2: [E] job_time_mgr_expire: job time out: >> cancelling flow operation, job_id: 121080. >> Oct 19 17:32:57 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel >> called on 0x13f52d20 >> Oct 19 17:32:57 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred >> Oct 19 17:32:57 server2 PVFS2: [E] handle_io_error: flow proto error cleanup >> started on 0x13f52d20: Operation cancelled (possibly due to timeout) >> Oct 19 17:32:59 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, >> proceeding anyway. >> Oct 19 17:32:59 server2 last message repeated 3 times >> Oct 19 17:32:59 server2 PVFS2: [E] handle_io_error: flow proto 0x13f52d20 >> canceled 4 operations, will clean up. >> Oct 19 17:32:59 server2 PVFS2: [E] job_time_mgr_expire: job time out: >> cancelling flow operation, job_id: 121085. >> Oct 19 17:32:59 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel >> called on 0x13f3c750 >> Oct 19 17:32:59 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred >> Oct 19 17:32:59 server2 PVFS2: [E] handle_io_error: flow proto error cleanup >> started on 0x13f3c750: Operation cancelled (possibly due to timeout) >> Oct 19 17:32:59 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, >> proceeding anyway. >> Oct 19 17:32:59 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred >> Oct 19 17:33:00 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, >> proceeding anyway. >> Oct 19 17:33:02 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred >> Oct 19 17:33:03 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, >> proceeding anyway. >> Oct 19 17:33:03 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, >> proceeding anyway. >> Oct 19 17:33:04 server2 PVFS2: [E] handle_io_error: flow proto 0x13f3c750 >> canceled 4 operations, will clean up. >> Oct 19 17:33:06 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred >> Oct 19 17:33:07 server2 PVFS2: [E] job_time_mgr_expire: job time out: >> cancelling flow operation, job_id: 121979. >> Oct 19 17:33:07 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred >> Oct 19 17:33:10 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel >> called on 0x2aaaac1d80e0 >> Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto 0x13f52d20 >> error cleanup finished: Operation cancelled (possibly due to timeout) >> Oct 19 17:33:11 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred >> Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto error cleanup >> started on 0x2aaaac1d80e0: Operation cancelled (possibly due to timeout) >> Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto >> 0x2aaaac1d80e0 canceled 1 operations, will clean up. >> Oct 19 17:33:12 server2 PVFS2: [E] job_time_mgr_expire: job time out: >> cancelling flow operation, job_id: 122772. >> Oct 19 17:33:12 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred >> Oct 19 17:33:12 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel >> called on 0x2aaaac4b2800 >> Oct 19 17:33:12 server2 PVFS2: [E] handle_io_error: flow proto >> 0x2aaaac1d80e0 error cleanup finished: Operation cancelled (possibly due to >> timeout) >> Oct 19 17:33:14 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred >> Oct 19 17:33:15 server2 PVFS2: [E] handle_io_error: flow proto error cleanup >> started on 0x2aaaac4b2800: Operation cancelled (possibly due to timeout) >> Oct 19 17:33:15 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, >> proceeding anyway. >> Oct 19 17:33:17 server2 last message repeated 3 times >> Oct 19 17:33:17 server2 PVFS2: [E] handle_io_error: flow proto >> 0x2aaaac4b2800 canceled 4 operations, will clean up. >> Oct 19 17:33:17 server2 PVFS2: [E] job_time_mgr_expire: job time out: >> cancelling flow operation, job_id: 122777. >> Oct 19 17:33:17 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred >> Oct 19 17:33:17 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel >> called on 0x2aaaac223d90 >> Oct 19 17:33:17 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred >> Oct 19 17:33:17 server2 PVFS2: [E] handle_io_error: flow proto error cleanup >> started on 0x2aaaac223d90: Operation cancelled (possibly due to timeout) >> Oct 19 17:33:17 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, >> proceeding anyway. >> Oct 19 17:33:18 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, >> proceeding anyway. >> Oct 19 17:33:18 server2 PVFS2: [E] handle_io_error: flow proto >> 0x2aaaac223d90 canceled 2 operations, will clean up. >> Oct 19 17:33:18 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred >> Oct 19 17:33:18 server2 last message repeated 5 times >> Oct 19 17:33:18 server2 PVFS2: [E] handle_io_error: flow proto 0x13f3c750 >> error cleanup finished: Operation cancelled (possibly due to timeout) >> Oct 19 17:33:18 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred >> Oct 19 17:33:19 server2 PVFS2: [E] handle_io_error: flow proto >> 0x2aaaac4b2800 error cleanup finished: Operation cancelled (possibly due to >> timeout) >> Oct 19 17:33:19 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred >> Oct 19 17:33:19 server2 PVFS2: [E] job_time_mgr_expire: job time out: >> cancelling flow operation, job_id: 123566. >> Oct 19 17:33:20 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel >> called on 0x2aaaac1bacf0 >> Oct 19 17:33:20 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred >> Oct 19 17:33:20 server2 PVFS2: [E] handle_io_error: flow proto error cleanup >> started on 0x2aaaac1bacf0: Operation cancelled (possibly due to timeout) >> Oct 19 17:33:20 server2 PVFS2: [E] PINT_manager_find_worker: operation >> posted with a queue id (30) that isn't held by this manager >> >> >> I tried killing the server to get it restarted, but it cannot be killed. I >> attached GDB and found this: >> >> >> Program received signal SIGINT, Interrupt. >> 0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0 >> (gdb) where >> #0 0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0 >> #1 0x0000003d70a08e1a in _L_lock_1034 () from /lib64/libpthread.so.0 >> #2 0x0000003d70a08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0 >> #3 0x0000000000452579 in gen_posix_mutex_lock (mut=0x6da360) at >> ../pvfs2_src/src/common/gen-locks/gen-locks.c:45 >> #4 0x0000000000431e67 in job_bmi_unexp_cancel (id=123372) at >> ../pvfs2_src/src/io/job/job.c:811 >> #5 0x0000000000413b06 in server_purge_unexpected_recv_machines () at >> ../pvfs2_src/src/server/pvfs2-server.c:1879 >> #6 0x000000000041340c in server_sig_handler (sig=15) at >> ../pvfs2_src/src/server/pvfs2-server.c:1652 >> #7 <signal handler called> >> #8 0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0 >> #9 0x0000003d70a08e1a in _L_lock_1034 () from /lib64/libpthread.so.0 >> #10 0x0000003d70a08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0 >> #11 0x0000000000452579 in gen_posix_mutex_lock (mut=0x6da360) at >> ../pvfs2_src/src/common/gen-locks/gen-locks.c:45 >> #12 0x0000000000431e67 in job_bmi_unexp_cancel (id=123372) at >> ../pvfs2_src/src/io/job/job.c:811 >> #13 0x0000000000413b06 in server_purge_unexpected_recv_machines () at >> ../pvfs2_src/src/server/pvfs2-server.c:1879 >> #14 0x000000000041340c in server_sig_handler (sig=11) at >> ../pvfs2_src/src/server/pvfs2-server.c:1652 >> #15 <signal handler called> >> #16 0x000000000049a367 in PINT_manager_cancel (manager=0x13e56a70, >> op_id=46912520503856) at ../pvfs2_src/src/common/mgmt/pint-mgmt.c:1162 >> #17 0x0000000000487b29 in dbpf_bstream_direct_cancel (coll_id=445355253, >> cancel_id=46912520503472, context_id=0) >> at ../pvfs2_src/src/io/trove/trove-dbpf/dbpf-bstream-direct.c:1249 >> #18 0x000000000047cbce in dbpf_dspace_cancel (coll_id=445355253, >> id=46912520503472, context_id=0) >> at ../pvfs2_src/src/io/trove/trove-dbpf/dbpf-dspace.c:1551 >> #19 0x000000000043ed18 in trove_dspace_cancel (coll_id=445355253, >> id=46912520503472, context_id=0) at ../pvfs2_src/src/io/trove/trove.c:1045 >> #20 0x0000000000439a65 in PINT_thread_mgr_trove_cancel (id=46912520503472, >> fs_id=445355253, user_ptr=0x2aaaac196670) >> at ../pvfs2_src/src/io/job/thread-mgr.c:726 >> #21 0x000000000047052f in cancel_pending_trove (list=0x2aaaac005e50, >> coll_id=445355253) >> at >> ../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:2160 >> #22 0x00000000004700af in handle_io_error (error_code=-1610613121, >> q_item=0x0, flow_data=0x2aaaac005c00) >> at >> ../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:2039 >> #23 0x000000000046bbd5 in fp_multiqueue_cancel (flow_d=0x2aaaac1bacf0) at >> ../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:480 >> #24 0x000000000043ae1f in PINT_flow_cancel (flow_d=0x2aaaac1bacf0) at >> ../pvfs2_src/src/io/flow/flow.c:426 >> #25 0x0000000000432706 in job_flow_cancel (id=123566, context_id=0) at >> ../pvfs2_src/src/io/job/job.c:1395 >> #26 0x000000000043a41f in job_time_mgr_expire () at >> ../pvfs2_src/src/io/job/job-time-mgr.c:277 >> #27 0x0000000000426af2 in job_timer_do_work (smcb=0x13fb78a0, >> js_p=0x13e3ecb0) at ../pvfs2_src/src/server/job-timer.sm:67 >> #28 0x0000000000464591 in PINT_state_machine_invoke (smcb=0x13fb78a0, >> r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:132 >> #29 0x000000000046498c in PINT_state_machine_next (smcb=0x13fb78a0, >> r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:309 >> #30 0x00000000004649cb in PINT_state_machine_continue (smcb=0x13fb78a0, >> r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:327 >> #31 0x0000000000410ef2 in main (argc=6, argv=0x7ffff6a54f28) at >> ../pvfs2_src/src/server/pvfs2-server.c:413 >> >> >> From the messages you probably deduced that we are using directio. This has >> happened twice now that I know of. It looks like the canellation path might >> be responsible, but I would appreciate someone taking a look. Can I provide >> any more information while the server is still in this state? I will need to >> fix it soon. >> >> Thanks, >> Bart. >> >> _______________________________________________ >> Pvfs2-developers mailing list >> [email protected] >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers >
_______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
