In an (increasingly desperate) attempt to get a stack that works with upstart on ubuntu I have recompiled from source (as per http://www.clusterlabs.org/wiki/Install#From_Source) on a clean maverick 64 bit server).
When running lradmin -C to list classes the first time it comes back immediately with the expected list r...@node1:/home# lrmadmin -C There are 5 RA classes supported: lsb ocf stonith upstart heartbeat All subsequent attempts hang and never comes back (you have to kill with crtl-C). This is repeatable on all the machines I have tried it on. reboot appears to be the only cure as corosync stop baulks on Waiting for corosync services to unload:......... Is this a related fault or something different? I have seen it before on other builds and seen posts that appear to report it. Anyway strace suggests that lrmadmin has stuck on /var/run.heartbeat/lrm_cmd_sock reporting "resource temporarily unavailable" but never responds to the outbound message : 17:43:41.328500 connect(3, {sa_family=AF_FILE, path="/var/run/heartbeat/lrm_cmd_sock"}, 110) = 0 17:43:41.328572 getsockopt(3, SOL_SOCKET, SO_PEERCRED, "\t\4\0\0\0\0\0\0\0\0\0\0", [12]) = 0 17:43:41.328788 getegid() = 0 17:43:41.328846 getuid() = 0 17:43:41.328970 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329050 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329154 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329202 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329263 sendto(3, "F\0\0\0\315\253\0\0>>>\nlrm_t=reg\nlrm_app=lr"..., 78, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 78 17:43:41.329337 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329380 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329420 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329458 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329497 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329535 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329574 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 17:43:41.329613 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 17:43:41.329651 poll([{fd=3, events=POLLIN}], 1, -1 <unfinished ...> lrmd process is still alive and there is nothing logged in /var/log/daemon.log. Its strace implies it never even saw the request on the socket. The process still has 3 file handles open on it: r...@node1:~# lsof /var/run/heartbeat/lrm_cmd_sock COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME lrmd 1420 root 3u unix 0xffff88001e011040 0t0 8732 /var/run/heartbeat/lrm_cmd_sock lrmd 1420 root 9u unix 0xffff88001e0b4d00 0t0 8782 /var/run/heartbeat/lrm_cmd_sock lrmd 1420 root 11u unix 0xffff88001e1a9d40 0t0 10211 /var/run/heartbeat/lrm_cmd_sock A good strace (ie lradmin -C after a reboot) starts identically to the strace above but receives a response from lrmd: ... 20:12:48.774239 poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}]) 20:12:48.774603 recvfrom(3, " \0\0\0\315\253\0\0>>>\nlrm_t=return\nlrm_ret"..., 4048, MSG_DONTWAIT, NULL, NULL) = 40 20:12:48.774661 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.774709 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.774756 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.774804 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.774851 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.774898 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.774945 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.775161 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.775210 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.775257 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 20:12:48.775304 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout) 20:12:48.775444 socket(PF_FILE, SOCK_STREAM, 0) = 4 20:12:48.775610 fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) 20:12:48.775686 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 20:12:48.775841 connect(4, {sa_family=AF_FILE, path="/var/run/heartbeat/lrm_callback_sock"}, 110) = 0 20:12:48.775907 getsockopt(4, SOL_SOCKET, SO_PEERCRED, "\214\5\0\0\0\0\0\0\0\0\0\0", [12]) = 0 ... Other commands like "crm configure verify" exhibits the same "hang" although I have not traced these. I guess they must use lrmd too. I havent tried recompiling without upstart support as I specifically need that but I have a suspicion it might be related. Maybe it has something to do with dbus although a "good" command seems to complete without obvious error. Versions are Cluster-Resource-Agents-051972b5cfd Pacemaker-1-0-b2e39d318fda Reusable-Cluster-Components-8658bcdd4511 flatiron - not sure but downloaded Friday 19th Anybody seen this characteristic or know how best for me to debug further? Thanks Dave _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker