Hi, On Tue, Nov 23, 2010 at 11:03:33PM +0000, Dave Williams wrote: > Hi, > I have a problem that looks similar to that reported "possible deadlock > in lrmd" on 21st Oct > > When running lradmin -C to list classes the first time it comes back > immediately with the expected list e.g. > > r...@node1:/home# lrmadmin -C > There are 5 RA classes supported: > lsb > ocf > stonith > upstart > heartbeat > > All subsequent attempts lrmadmin hangs and never comes back (you have to kill > with crtl-C). This is repeatable on all the machines I have tried it on.
I'm afraid that this was to be expected. > reboot appears to be the only cure as corosync stop baulks on > Waiting for corosync services to unload:......... > > The environment is compiled from source (as per > http://www.clusterlabs.org/wiki/Install#From_Source) on a clean maverick > > server (both 32 and 64 bit) > > Versions are > Cluster-Resource-Agents-051972b5cfd > Pacemaker-1-0-b2e39d318fda > Reusable-Cluster-Components-8658bcdd4511 > flatiron - not sure but downloaded Friday 19th > > > Strace suggests that lrmadmin has stuck on > /var/run.heartbeat/lrm_cmd_sock reporting "resource temporarily > unavailable" but never responds to the outbound message : > > 17:43:41.328500 connect(3, {sa_family=AF_FILE, > path="/var/run/heartbeat/lrm_cmd_sock"}, 110) = 0 > 17:43:41.328572 getsockopt(3, SOL_SOCKET, SO_PEERCRED, > "\t\4\0\0\0\0\0\0\0\0\0\0", [12]) = 0 > 17:43:41.328788 getegid() = 0 > 17:43:41.328846 getuid() = 0 > 17:43:41.328970 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 17:43:41.329050 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 17:43:41.329154 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 17:43:41.329202 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 17:43:41.329263 sendto(3, > "F\0\0\0\315\253\0\0>>>\nlrm_t=reg\nlrm_app=lr"..., 78, > MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 78 > 17:43:41.329337 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 17:43:41.329380 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 17:43:41.329420 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 17:43:41.329458 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 17:43:41.329497 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 17:43:41.329535 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 17:43:41.329574 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 17:43:41.329613 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 17:43:41.329651 poll([{fd=3, events=POLLIN}], 1, -1 <unfinished ...> > > > The lrmd process is still alive and there is nothing logged in > /var/log/daemon.log. Its strace implies it never even saw the request on > the socket. The process still has 3 file handles open on it: > r...@node1:~# lsof /var/run/heartbeat/lrm_cmd_sock > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > lrmd 1420 root 3u unix 0xffff88001e011040 0t0 8732 > /var/run/heartbeat/lrm_cmd_sock > lrmd 1420 root 9u unix 0xffff88001e0b4d00 0t0 8782 > /var/run/heartbeat/lrm_cmd_sock > lrmd 1420 root 11u unix 0xffff88001e1a9d40 0t0 10211 > /var/run/heartbeat/lrm_cmd_sock > > > > A good strace (ie lradmin -C after a reboot) starts identically to the > strace above but receives a response from lrmd: > ... > 20:12:48.774239 poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, > revents=POLLIN}]) > 20:12:48.774603 recvfrom(3, " > \0\0\0\315\253\0\0>>>\nlrm_t=return\nlrm_ret"..., 4048, MSG_DONTWAIT, > NULL, NULL) = 40 > 20:12:48.774661 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 20:12:48.774709 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 20:12:48.774756 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 20:12:48.774804 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 20:12:48.774851 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 20:12:48.774898 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 20:12:48.774945 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 20:12:48.775161 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 20:12:48.775210 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 20:12:48.775257 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN > (Resource temporarily unavailable) > 20:12:48.775304 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) > 20:12:48.775444 socket(PF_FILE, SOCK_STREAM, 0) = 4 > 20:12:48.775610 fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) > 20:12:48.775686 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > 20:12:48.775841 connect(4, {sa_family=AF_FILE, > path="/var/run/heartbeat/lrm_callback_sock"}, 110) = 0 > 20:12:48.775907 getsockopt(4, SOL_SOCKET, SO_PEERCRED, > 20:12:48.775907 getsockopt(4, SOL_SOCKET, SO_PEERCRED, > "\214\5\0\0\0\0\0\0\0\0\0\0", [12]) = 0 > ... > > > Other commands like "crm configure verify" exhibits the same "hang" > although I have not traced these. I guess they must use lrmd too. > > > oYou can throw a SIGUSR1 to lrmd to print some debug. > This is the what was shown just after I executed the first > lrmadmin -C > Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: BEGIN resource dump > Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: resource > vsftpd/vsftpd/upstart/<null> > Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: rsc->op_list... > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->op_list: lrmd_op: operation > monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: > CRM_meta_name=[monitor] cr > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->op_list: lrmd_op2: rt_tag: 0, > interval: 5000, delay: 0 > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->op_list: lrmd_op3: t_recv: > 210940ms, t_add: 20ms, t_perform: 20ms, t_done: 0ms, t_rcchange: 210940ms > Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: rsc->repeat_op_list... > Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: rsc->last_op_done... > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_done: lrmd_op: > operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its > parameters: CRM_meta_name=[monito > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_done: lrmd_op2: > rt_tag: 0, interval: 5000, delay: 0 > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_done: lrmd_op3: > t_recv: 210940ms, t_add: 5040ms, t_perform: 5040ms, t_done: 5020ms, > t_rcchange: 210940ms > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: > operation monitor[5] on upstart::vsftpd::vsftpd for client 1041, its > parameters: crm_feature_set=[3.0. > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: > rt_tag: 0, interval: 0, delay: 0 > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: > t_recv: 214510ms, t_add: 214510ms, t_perform: 214510ms, t_done: 214470ms, > t_rcchange: 214510ms > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: > operation stop[8] on upstart::vsftpd::vsftpd for client 1041, its parameters: > crm_feature_set=[3.0.1] > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: > rt_tag: 0, interval: 0, delay: 0 > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: > t_recv: 213210ms, t_add: 213210ms, t_perform: 213210ms, t_done: 213180ms, > t_rcchange: 213210ms > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: > operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its > parameters: CRM_meta_name=[monit > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: > rt_tag: 0, interval: 5000, delay: 0 > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: > t_recv: 210940ms, t_add: 5040ms, t_perform: 5040ms, t_done: 5020ms, > t_rcchange: 210940ms > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: > operation start[19] on upstart::vsftpd::vsftpd for client 1041, its > parameters: crm_feature_set=[3.0.1 > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: > rt_tag: 0, interval: 0, delay: 0 > Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: > t_recv: 210990ms, t_add: 210990ms, t_perform: 210990ms, t_done: 210960ms, > t_rcchange: 210990ms > Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: END resource dump > Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done:operation monitor[20] > on upstart::vsftpd::vsftpd for client 1041, its parameters: > CRM_meta_name=[monitor] crm_feature_se > Nov 22 21:12:39 node1 lrmd: [1038]: debug: lrmd_op_destroy: free the > operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its > parameters: CRM_meta_name=[monitor] > Nov 22 21:12:39 node1 lrmd: [1038]: debug: replace_last_op: replace last op > monitor_5000 for client crmd > Nov 22 21:12:39 node1 lrmd: [1038]: debug: lrmd_op_destroy: free the > operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its > parameters: CRM_meta_name=[monitor] > Nov 22 21:12:39 node1 lrmd: [1038]: debug: to_repeatlist: repeat operation > monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: > CRM_meta_name=[monitor] crm > Nov 22 21:12:39 node1 lrmd: [1038]: debug: lrmd_op_destroy: free the > operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its > parameters: CRM_meta_name=[monitor] > Nov 22 21:12:39 node1 lrmd: [1038]: debug: perform_op: no op to perform? > Nov 22 21:12:42 node1 lrmd: [1038]: debug: > G_main_IPC_Channel_constructor(sock=11,11) > Nov 22 21:12:42 node1 lrmd: [1038]: debug: dumping request: > >>>#012lrm_t=reg#012lrm_app=lrmadmin#012lrm_pid=2438#012lrm_gid=0#012lrm_uid=0#012<<< > Nov 22 21:12:42 node1 lrmd: [1038]: debug: on_msg_register:client lrmadmin > [2438] registered > Nov 22 21:12:42 node1 lrmd: [1038]: debug: > G_main_IPC_Channel_constructor(sock=12,12) > Nov 22 21:12:42 node1 lrmd: [1038]: debug: dumping request: > >>>#012lrm_t=rclasses#012<<< > Nov 22 21:12:42 node1 lrmd: [1038]: debug: on_msg_get_rsc_classes:client > [2438] wants to get rsc classes > Nov 22 21:12:42 node1 lrmd: [1038]: debug: on_receive_cmd: the IPC to client > [pid:2438] disconnected. > Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_CH_destroy_int(chp=0x25e5830, > sock=11) { > Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_CH_destroy_int: Calling > dnotify(sock=11, arg=0x25ea310) function > Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_main_del_IPC_Channel(sock=11) > > lrmd doesnt respond to subsequent signal requests so I cannot get any further > debug using this technique. > > backtrace from gdb shows lrmd to be in a lock_wait > #0 0x00007f7e5f8ba6b4 in __lll_lock_wait () from /lib/libpthread.so.0 > #1 0x00007f7e5f8b5849 in _L_lock_953 () from /lib/libpthread.so.0 > #2 0x00007f7e5f8b566b in pthread_mutex_lock () from /lib/libpthread.so.0 > #3 0x00007f7e601b0806 in g_main_context_find_source_by_id () from > /lib/libglib-2.0.so.0 > #4 0x00007f7e601b08fe in g_source_remove () from /lib/libglib-2.0.so.0 > #5 0x00007f7e61568ba1 in G_main_del_IPC_Channel (chp=0x11deed0) at > GSource.c:495 > #6 0x00000000004065a1 in on_remove_client (user_data=0x11df8e0) at > lrmd.c:1526 > #7 0x00007f7e615694ca in G_CH_destroy_int (source=0x11deed0) at GSource.c:675 > #8 0x00007f7e601adc11 in ?? () from /lib/libglib-2.0.so.0 > #9 0x00007f7e601ae428 in g_main_context_dispatch () from > /lib/libglib-2.0.so.0 > #10 0x00007f7e601b22a8 in ?? () from /lib/libglib-2.0.so.0 > #11 0x00007f7e601b27b5 in g_main_loop_run () from /lib/libglib-2.0.so.0 > #12 0x0000000000405d32 in init_start () at lrmd.c:1267 > #13 0x0000000000404f7a in main (argc=1, argv=0x7fff91e24478) at lrmd.c:835 > > > Looking at the source code is that upon lrmadmin client disconnecting from > lrmd's cmd socket (having got what it needs) lrmd is left to tidy up by > deleting the client event source from the GMainContext GLib loop. It is in the > process of calling g_source_remove() which then hangs deep inside GLib on a > mutex lock. > > On the surface the overall sequence makes sense but the hang doesnt and > clearly shouldnt happen. I am at a loss as to whether it is a GLib > issues (unlikely I would have thought?) or its an lrmd bug. It's neither. It's bad usage of glib. > IMHO lrmd should NEVER hang! If you don't use upstart, it won't hang. Thanks, Dejan > Thanks > Dave > _______________________________________________________ > Linux-HA-Dev: [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
