Hi, I have a problem that looks similar to that reported "possible deadlock in lrmd" on 21st Oct
When running lradmin -C to list classes the first time it comes back immediately with the expected list e.g. r...@node1:/home# lrmadmin -C There are 5 RA classes supported: lsb ocf stonith upstart heartbeat All subsequent attempts lrmadmin hangs and never comes back (you have to kill with crtl-C). This is repeatable on all the machines I have tried it on. reboot appears to be the only cure as corosync stop baulks on Waiting for corosync services to unload:......... The environment is compiled from source (as per http://www.clusterlabs.org/wiki/Install#From_Source) on a clean maverick server (both 32 and 64 bit) Versions are Cluster-Resource-Agents-051972b5cfd Pacemaker-1-0-b2e39d318fda Reusable-Cluster-Components-8658bcdd4511 flatiron - not sure but downloaded Friday 19th Strace suggests that lrmadmin has stuck on /var/run.heartbeat/lrm_cmd_sock reporting "resource temporarily unavailable" but never responds to the outbound message : 17:43:41.328500 connect(3, {sa_family=AF_FILE, path="/var/run/heartbeat/lrm_cmd_sock"}, 110) = 0 17:43:41.328572 getsockopt(3, SOL_SOCKET, SO_PEERCRED, "\t\4\0\0\0\0\0\0\0\0\0\0", [12]) = 0 17:43:41.328788 getegid() = 0 17:43:41.328846 getuid() = 0 17:43:41.328970 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329050 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329154 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329202 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329263 sendto(3, "F\0\0\0\315\253\0\0>>>\nlrm_t=reg\nlrm_app=lr"..., 78, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 78 17:43:41.329337 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329380 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329420 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329458 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329497 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329535 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329574 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329613 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329651 poll([{fd=3, events=POLLIN}], 1, -1 <unfinished ...> The lrmd process is still alive and there is nothing logged in /var/log/daemon.log. Its strace implies it never even saw the request on the socket. The process still has 3 file handles open on it: r...@node1:~# lsof /var/run/heartbeat/lrm_cmd_sock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME lrmd 1420 root 3u unix 0xffff88001e011040 0t0 8732 /var/run/heartbeat/lrm_cmd_sock lrmd 1420 root 9u unix 0xffff88001e0b4d00 0t0 8782 /var/run/heartbeat/lrm_cmd_sock lrmd 1420 root 11u unix 0xffff88001e1a9d40 0t0 10211 /var/run/heartbeat/lrm_cmd_sock A good strace (ie lradmin -C after a reboot) starts identically to the strace above but receives a response from lrmd: ... 20:12:48.774239 poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}]) 20:12:48.774603 recvfrom(3, " \0\0\0\315\253\0\0>>>\nlrm_t=return\nlrm_ret"..., 4048, MSG_DONTWAIT, NULL, NULL) = 40 20:12:48.774661 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.774709 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.774756 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.774804 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.774851 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.774898 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.774945 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.775161 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.775210 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.775257 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.775304 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.775444 socket(PF_FILE, SOCK_STREAM, 0) = 4 20:12:48.775610 fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) 20:12:48.775686 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 20:12:48.775841 connect(4, {sa_family=AF_FILE, path="/var/run/heartbeat/lrm_callback_sock"}, 110) = 0 20:12:48.775907 getsockopt(4, SOL_SOCKET, SO_PEERCRED, 20:12:48.775907 getsockopt(4, SOL_SOCKET, SO_PEERCRED, "\214\5\0\0\0\0\0\0\0\0\0\0", [12]) = 0 ... Other commands like "crm configure verify" exhibits the same "hang" although I have not traced these. I guess they must use lrmd too. oYou can throw a SIGUSR1 to lrmd to print some debug. This is the what was shown just after I executed the first lrmadmin -C Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: BEGIN resource dump Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: resource vsftpd/vsftpd/upstart/<null> Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: rsc->op_list... Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->op_list: lrmd_op: operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor] cr Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->op_list: lrmd_op2: rt_tag: 0, interval: 5000, delay: 0 Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->op_list: lrmd_op3: t_recv: 210940ms, t_add: 20ms, t_perform: 20ms, t_done: 0ms, t_rcchange: 210940ms Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: rsc->repeat_op_list... Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: rsc->last_op_done... Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_done: lrmd_op: operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monito Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_done: lrmd_op2: rt_tag: 0, interval: 5000, delay: 0 Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_done: lrmd_op3: t_recv: 210940ms, t_add: 5040ms, t_perform: 5040ms, t_done: 5020ms, t_rcchange: 210940ms Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: operation monitor[5] on upstart::vsftpd::vsftpd for client 1041, its parameters: crm_feature_set=[3.0. Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: rt_tag: 0, interval: 0, delay: 0 Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: t_recv: 214510ms, t_add: 214510ms, t_perform: 214510ms, t_done: 214470ms, t_rcchange: 214510ms Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: operation stop[8] on upstart::vsftpd::vsftpd for client 1041, its parameters: crm_feature_set=[3.0.1] Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: rt_tag: 0, interval: 0, delay: 0 Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: t_recv: 213210ms, t_add: 213210ms, t_perform: 213210ms, t_done: 213180ms, t_rcchange: 213210ms Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monit Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: rt_tag: 0, interval: 5000, delay: 0 Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: t_recv: 210940ms, t_add: 5040ms, t_perform: 5040ms, t_done: 5020ms, t_rcchange: 210940ms Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: operation start[19] on upstart::vsftpd::vsftpd for client 1041, its parameters: crm_feature_set=[3.0.1 Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: rt_tag: 0, interval: 0, delay: 0 Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: t_recv: 210990ms, t_add: 210990ms, t_perform: 210990ms, t_done: 210960ms, t_rcchange: 210990ms Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: END resource dump Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done:operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor] crm_feature_se Nov 22 21:12:39 node1 lrmd: [1038]: debug: lrmd_op_destroy: free the operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor] Nov 22 21:12:39 node1 lrmd: [1038]: debug: replace_last_op: replace last op monitor_5000 for client crmd Nov 22 21:12:39 node1 lrmd: [1038]: debug: lrmd_op_destroy: free the operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor] Nov 22 21:12:39 node1 lrmd: [1038]: debug: to_repeatlist: repeat operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor] crm Nov 22 21:12:39 node1 lrmd: [1038]: debug: lrmd_op_destroy: free the operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor] Nov 22 21:12:39 node1 lrmd: [1038]: debug: perform_op: no op to perform? Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_main_IPC_Channel_constructor(sock=11,11) Nov 22 21:12:42 node1 lrmd: [1038]: debug: dumping request: >>>#012lrm_t=reg#012lrm_app=lrmadmin#012lrm_pid=2438#012lrm_gid=0#012lrm_uid=0#012<<< Nov 22 21:12:42 node1 lrmd: [1038]: debug: on_msg_register:client lrmadmin [2438] registered Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_main_IPC_Channel_constructor(sock=12,12) Nov 22 21:12:42 node1 lrmd: [1038]: debug: dumping request: >>>#012lrm_t=rclasses#012<<< Nov 22 21:12:42 node1 lrmd: [1038]: debug: on_msg_get_rsc_classes:client [2438] wants to get rsc classes Nov 22 21:12:42 node1 lrmd: [1038]: debug: on_receive_cmd: the IPC to client [pid:2438] disconnected. Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_CH_destroy_int(chp=0x25e5830, sock=11) { Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_CH_destroy_int: Calling dnotify(sock=11, arg=0x25ea310) function Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_main_del_IPC_Channel(sock=11) lrmd doesnt respond to subsequent signal requests so I cannot get any further debug using this technique. backtrace from gdb shows lrmd to be in a lock_wait #0 0x00007f7e5f8ba6b4 in __lll_lock_wait () from /lib/libpthread.so.0 #1 0x00007f7e5f8b5849 in _L_lock_953 () from /lib/libpthread.so.0 #2 0x00007f7e5f8b566b in pthread_mutex_lock () from /lib/libpthread.so.0 #3 0x00007f7e601b0806 in g_main_context_find_source_by_id () from /lib/libglib-2.0.so.0 #4 0x00007f7e601b08fe in g_source_remove () from /lib/libglib-2.0.so.0 #5 0x00007f7e61568ba1 in G_main_del_IPC_Channel (chp=0x11deed0) at GSource.c:495 #6 0x00000000004065a1 in on_remove_client (user_data=0x11df8e0) at lrmd.c:1526 #7 0x00007f7e615694ca in G_CH_destroy_int (source=0x11deed0) at GSource.c:675 #8 0x00007f7e601adc11 in ?? () from /lib/libglib-2.0.so.0 #9 0x00007f7e601ae428 in g_main_context_dispatch () from /lib/libglib-2.0.so.0 #10 0x00007f7e601b22a8 in ?? () from /lib/libglib-2.0.so.0 #11 0x00007f7e601b27b5 in g_main_loop_run () from /lib/libglib-2.0.so.0 #12 0x0000000000405d32 in init_start () at lrmd.c:1267 #13 0x0000000000404f7a in main (argc=1, argv=0x7fff91e24478) at lrmd.c:835 Looking at the source code is that upon lrmadmin client disconnecting from lrmd's cmd socket (having got what it needs) lrmd is left to tidy up by deleting the client event source from the GMainContext GLib loop. It is in the process of calling g_source_remove() which then hangs deep inside GLib on a mutex lock. On the surface the overall sequence makes sense but the hang doesnt and clearly shouldnt happen. I am at a loss as to whether it is a GLib issues (unlikely I would have thought?) or its an lrmd bug. IMHO lrmd should NEVER hang! Thanks Dave _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
