Hey guys,
We have run into an issue a couple of times now were one of our 2.8 server
daemons will hang and cannot be killed. This happened again last night and I
captured some info from it this time. This particular case has 4 server
nodes running 2.6.18-194 kernels. About the time a lot of access started,
there were some client timeout messages and several of these in dmesg:
*****************************************************
PVFS2 Device Error: You cannot open the device file
/dev/pvfs2-req more than once. Please make sure that
there are no instances of a program using this device
currently running. (You must verify this!)
For example, you can use the lsof program as follows:
'lsof | grep pvfs2-req' (run this as root)
open_access_count = 1
*****************************************************
A pvfs2-ping hung on the second server, and I discovered this in syslog on
that server. The messages begin when the server was started yesterday and
end when the errors stopped.
Oct 19 14:39:52 server2 PVFS2: [E] PVFS2 server got signal 15
(server_status_flag: 507903)
Oct 19 14:39:53 server2 kernel: kjournald starting. Commit interval 5
seconds
Oct 19 14:39:53 server2 kernel: EXT3 FS on sdb, internal journal
Oct 19 14:39:53 server2 kernel: EXT3-fs: mounted filesystem with ordered
data mode.
Oct 19 14:39:58 server2 PVFS2: [D] PVFS2 Server version 2.8.2 starting.
Oct 19 17:32:57 server2 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling flow operation, job_id: 121080.
Oct 19 17:32:57 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel
called on 0x13f52d20
Oct 19 17:32:57 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
Oct 19 17:32:57 server2 PVFS2: [E] handle_io_error: flow proto error cleanup
started on 0x13f52d20: Operation cancelled (possibly due to timeout)
Oct 19 17:32:59 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed,
proceeding anyway.
Oct 19 17:32:59 server2 last message repeated 3 times
Oct 19 17:32:59 server2 PVFS2: [E] handle_io_error: flow proto 0x13f52d20
canceled 4 operations, will clean up.
Oct 19 17:32:59 server2 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling flow operation, job_id: 121085.
Oct 19 17:32:59 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel
called on 0x13f3c750
Oct 19 17:32:59 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
Oct 19 17:32:59 server2 PVFS2: [E] handle_io_error: flow proto error cleanup
started on 0x13f3c750: Operation cancelled (possibly due to timeout)
Oct 19 17:32:59 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed,
proceeding anyway.
Oct 19 17:32:59 server2 PVFS2: [E] trove_read_callback_fn: I/O error
occurred
Oct 19 17:33:00 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed,
proceeding anyway.
Oct 19 17:33:02 server2 PVFS2: [E] trove_read_callback_fn: I/O error
occurred
Oct 19 17:33:03 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed,
proceeding anyway.
Oct 19 17:33:03 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed,
proceeding anyway.
Oct 19 17:33:04 server2 PVFS2: [E] handle_io_error: flow proto 0x13f3c750
canceled 4 operations, will clean up.
Oct 19 17:33:06 server2 PVFS2: [E] trove_read_callback_fn: I/O error
occurred
Oct 19 17:33:07 server2 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling flow operation, job_id: 121979.
Oct 19 17:33:07 server2 PVFS2: [E] trove_read_callback_fn: I/O error
occurred
Oct 19 17:33:10 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel
called on 0x2aaaac1d80e0
Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto 0x13f52d20
error cleanup finished: Operation cancelled (possibly due to timeout)
Oct 19 17:33:11 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto error cleanup
started on 0x2aaaac1d80e0: Operation cancelled (possibly due to timeout)
Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto
0x2aaaac1d80e0 canceled 1 operations, will clean up.
Oct 19 17:33:12 server2 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling flow operation, job_id: 122772.
Oct 19 17:33:12 server2 PVFS2: [E] trove_read_callback_fn: I/O error
occurred
Oct 19 17:33:12 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel
called on 0x2aaaac4b2800
Oct 19 17:33:12 server2 PVFS2: [E] handle_io_error: flow proto
0x2aaaac1d80e0 error cleanup finished: Operation cancelled (possibly due to
timeout)
Oct 19 17:33:14 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
Oct 19 17:33:15 server2 PVFS2: [E] handle_io_error: flow proto error cleanup
started on 0x2aaaac4b2800: Operation cancelled (possibly due to timeout)
Oct 19 17:33:15 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed,
proceeding anyway.
Oct 19 17:33:17 server2 last message repeated 3 times
Oct 19 17:33:17 server2 PVFS2: [E] handle_io_error: flow proto
0x2aaaac4b2800 canceled 4 operations, will clean up.
Oct 19 17:33:17 server2 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling flow operation, job_id: 122777.
Oct 19 17:33:17 server2 PVFS2: [E] trove_read_callback_fn: I/O error
occurred
Oct 19 17:33:17 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel
called on 0x2aaaac223d90
Oct 19 17:33:17 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
Oct 19 17:33:17 server2 PVFS2: [E] handle_io_error: flow proto error cleanup
started on 0x2aaaac223d90: Operation cancelled (possibly due to timeout)
Oct 19 17:33:17 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed,
proceeding anyway.
Oct 19 17:33:18 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed,
proceeding anyway.
Oct 19 17:33:18 server2 PVFS2: [E] handle_io_error: flow proto
0x2aaaac223d90 canceled 2 operations, will clean up.
Oct 19 17:33:18 server2 PVFS2: [E] trove_read_callback_fn: I/O error
occurred
Oct 19 17:33:18 server2 last message repeated 5 times
Oct 19 17:33:18 server2 PVFS2: [E] handle_io_error: flow proto 0x13f3c750
error cleanup finished: Operation cancelled (possibly due to timeout)
Oct 19 17:33:18 server2 PVFS2: [E] trove_read_callback_fn: I/O error
occurred
Oct 19 17:33:19 server2 PVFS2: [E] handle_io_error: flow proto
0x2aaaac4b2800 error cleanup finished: Operation cancelled (possibly due to
timeout)
Oct 19 17:33:19 server2 PVFS2: [E] trove_read_callback_fn: I/O error
occurred
Oct 19 17:33:19 server2 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling flow operation, job_id: 123566.
Oct 19 17:33:20 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel
called on 0x2aaaac1bacf0
Oct 19 17:33:20 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
Oct 19 17:33:20 server2 PVFS2: [E] handle_io_error: flow proto error cleanup
started on 0x2aaaac1bacf0: Operation cancelled (possibly due to timeout)
Oct 19 17:33:20 server2 PVFS2: [E] PINT_manager_find_worker: operation
posted with a queue id (30) that isn't held by this manager
I tried killing the server to get it restarted, but it cannot be killed. I
attached GDB and found this:
Program received signal SIGINT, Interrupt.
0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0 0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003d70a08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003d70a08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000452579 in gen_posix_mutex_lock (mut=0x6da360) at
../pvfs2_src/src/common/gen-locks/gen-locks.c:45
#4 0x0000000000431e67 in job_bmi_unexp_cancel (id=123372) at
../pvfs2_src/src/io/job/job.c:811
#5 0x0000000000413b06 in server_purge_unexpected_recv_machines () at
../pvfs2_src/src/server/pvfs2-server.c:1879
#6 0x000000000041340c in server_sig_handler (sig=15) at
../pvfs2_src/src/server/pvfs2-server.c:1652
#7 <signal handler called>
#8 0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#9 0x0000003d70a08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#10 0x0000003d70a08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#11 0x0000000000452579 in gen_posix_mutex_lock (mut=0x6da360) at
../pvfs2_src/src/common/gen-locks/gen-locks.c:45
#12 0x0000000000431e67 in job_bmi_unexp_cancel (id=123372) at
../pvfs2_src/src/io/job/job.c:811
#13 0x0000000000413b06 in server_purge_unexpected_recv_machines () at
../pvfs2_src/src/server/pvfs2-server.c:1879
#14 0x000000000041340c in server_sig_handler (sig=11) at
../pvfs2_src/src/server/pvfs2-server.c:1652
#15 <signal handler called>
#16 0x000000000049a367 in PINT_manager_cancel (manager=0x13e56a70,
op_id=46912520503856) at ../pvfs2_src/src/common/mgmt/pint-mgmt.c:1162
#17 0x0000000000487b29 in dbpf_bstream_direct_cancel (coll_id=445355253,
cancel_id=46912520503472, context_id=0)
at ../pvfs2_src/src/io/trove/trove-dbpf/dbpf-bstream-direct.c:1249
#18 0x000000000047cbce in dbpf_dspace_cancel (coll_id=445355253,
id=46912520503472, context_id=0)
at ../pvfs2_src/src/io/trove/trove-dbpf/dbpf-dspace.c:1551
#19 0x000000000043ed18 in trove_dspace_cancel (coll_id=445355253,
id=46912520503472, context_id=0) at ../pvfs2_src/src/io/trove/trove.c:1045
#20 0x0000000000439a65 in PINT_thread_mgr_trove_cancel (id=46912520503472,
fs_id=445355253, user_ptr=0x2aaaac196670)
at ../pvfs2_src/src/io/job/thread-mgr.c:726
#21 0x000000000047052f in cancel_pending_trove (list=0x2aaaac005e50,
coll_id=445355253)
at
../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:2160
#22 0x00000000004700af in handle_io_error (error_code=-1610613121,
q_item=0x0, flow_data=0x2aaaac005c00)
at
../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:2039
#23 0x000000000046bbd5 in fp_multiqueue_cancel (flow_d=0x2aaaac1bacf0) at
../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:480
#24 0x000000000043ae1f in PINT_flow_cancel (flow_d=0x2aaaac1bacf0) at
../pvfs2_src/src/io/flow/flow.c:426
#25 0x0000000000432706 in job_flow_cancel (id=123566, context_id=0) at
../pvfs2_src/src/io/job/job.c:1395
#26 0x000000000043a41f in job_time_mgr_expire () at
../pvfs2_src/src/io/job/job-time-mgr.c:277
#27 0x0000000000426af2 in job_timer_do_work (smcb=0x13fb78a0,
js_p=0x13e3ecb0) at ../pvfs2_src/src/server/job-timer.sm:67
#28 0x0000000000464591 in PINT_state_machine_invoke (smcb=0x13fb78a0,
r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:132
#29 0x000000000046498c in PINT_state_machine_next (smcb=0x13fb78a0,
r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:309
#30 0x00000000004649cb in PINT_state_machine_continue (smcb=0x13fb78a0,
r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:327
#31 0x0000000000410ef2 in main (argc=6, argv=0x7ffff6a54f28) at
../pvfs2_src/src/server/pvfs2-server.c:413
>From the messages you probably deduced that we are using directio. This has
happened twice now that I know of. It looks like the canellation path might
be responsible, but I would appreciate someone taking a look. Can I provide
any more information while the server is still in this state? I will need to
fix it soon.
Thanks,
Bart.
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers