On 11/04/2013, at 6:05 AM, Jimmy Magee <[email protected]> wrote:
> Hi, > > Following up on the above thread, any thoughts as to what may be causing the > issue.. One of the main reasons pacemakerd was created was to avoid weirdness around the starting of pacemaker's child processes from within a multi-threaded application like corosync... which is almost certainly what you're bumping into here. Could you try using "ver: 1" in corosync.conf and "service pacemaker start" to rule out any other causes? > > Cheers, > Jimmy. > > > > On 9 Apr 2013, at 13:39, Jimmy Magee <[email protected]> wrote: > >> Hi Andrew, >> >> The corosync.conf is configured as follows: >> >> >>> service { >>> # Load the Pacemaker Cluster Resource Manager >>> name: pacemaker >>> ver: 0 >>> } >> >> >> >> and pacemaker is not started via service pacemaker start… >> >> here is the extract from the logs with extra debug when attempting to start >> corosync/pacemaker.. >> >> 06:59:20 corosync [MAIN ] Corosync Cluster Engine ('1.4.1'): started and >> ready to provide service. >> 06:59:20 corosync [MAIN ] Corosync built-in features: nss dbus rdma snmp >> 06:59:20 corosync [MAIN ] Successfully read main configuration file >> '/etc/corosync/corosync.conf'. >> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1 >> 06:59:20 corosync [TOTEM ] Token Timeout (5000 ms) retransmit timeout (247 >> ms) >> 06:59:20 corosync [TOTEM ] token hold (187 ms) retransmits before loss (20 >> retrans) >> 06:59:20 corosync [TOTEM ] join (1000 ms) send_join (0 ms) consensus (7500 >> ms) merge (200 ms) >> 06:59:20 corosync [TOTEM ] downcheck (1000 ms) fail to recv const (2500 msgs) >> 06:59:20 corosync [TOTEM ] seqno unchanged const (30 rotations) Maximum >> network MTU 1402 >> 06:59:20 corosync [TOTEM ] window size per rotation (50 messages) maximum >> messages per rotation (20 messages) >> 06:59:20 corosync [TOTEM ] missed count const (5 messages) >> 06:59:20 corosync [TOTEM ] send threads (0 threads) >> 06:59:20 corosync [TOTEM ] RRP token expired timeout (247 ms) >> 06:59:20 corosync [TOTEM ] RRP token problem counter (2000 ms) >> 06:59:20 corosync [TOTEM ] RRP threshold (10 problem count) >> 06:59:20 corosync [TOTEM ] RRP multicast threshold (100 problem count) >> 06:59:20 corosync [TOTEM ] RRP automatic recovery check timeout (1000 ms) >> 06:59:20 corosync [TOTEM ] RRP mode set to none. >> 06:59:20 corosync [TOTEM ] heartbeat_failures_allowed (0) >> 06:59:20 corosync [TOTEM ] max_network_delay (50 ms) >> 06:59:20 corosync [TOTEM ] HeartBeat is Disabled. To enable set >> heartbeat_failures_allowed > 0 >> 06:59:20 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). >> 06:59:20 corosync [TOTEM ] Initializing transmit/receive security: >> libtomcrypt SOBER128/SHA1HMAC (mode 0). >> 06:59:20 corosync [IPC ] you are using ipc api v2 >> 06:59:20 corosync [TOTEM ] Receive multicast socket recv buffer size (320000 >> bytes). >> 06:59:20 corosync [TOTEM ] Transmit multicast socket send buffer size >> (320000 bytes). >> 06:59:20 corosync [TOTEM ] Local receive multicast loop socket recv buffer >> size (320000 bytes). >> 06:59:20 corosync [TOTEM ] Local transmit multicast loop socket send buffer >> size (320000 bytes). >> 06:59:20 corosync [TOTEM ] The network interface [10.87.79.59] is now up. >> 06:59:20 corosync [TOTEM ] Created or loaded sequence id 6984.10.87.79.59 >> for this ring. >> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log >> 06:59:20 corosync [pcmk ] Logging: Initialized pcmk_startup >> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log >> 06:59:20 corosync [SERV ] Service engine loaded: Pacemaker Cluster Manager >> 1.1.6 >> 06:59:20 corosync [pcmk ] Logging: Initialized pcmk_startup >> 06:59:20 corosync [SERV ] Service engine loaded: Pacemaker Cluster Manager >> 1.1.6 >> 06:59:20 corosync [SERV ] Service engine loaded: corosync extended virtual >> synchrony service >> 06:59:20 corosync [SERV ] Service engine loaded: corosync configuration >> service >> 06:59:20 corosync [SERV ] Service engine loaded: corosync cluster closed >> process group service v1.01 >> 06:59:20 corosync [SERV ] Service engine loaded: corosync cluster config >> database access v1.01 >> 06:59:20 corosync [SERV ] Service engine loaded: corosync profile loading >> service >> 06:59:20 corosync [SERV ] Service engine loaded: corosync cluster quorum >> service v0.1 >> 06:59:20 corosync [MAIN ] Compatibility mode set to whitetank. Using V1 >> and V2 of the synchronization engine. >> 06:59:20 corosync [TOTEM ] entering GATHER state from 15. >> 06:59:20 corosync [TOTEM ] Creating commit token because I am the rep. >> 06:59:20 corosync [TOTEM ] Saving state aru 0 high seq received 0 >> 06:59:20 corosync [TOTEM ] Storing new sequence id for ring 1b4c >> 06:59:20 corosync [TOTEM ] entering COMMIT state. >> 06:59:20 corosync [TOTEM ] got commit token >> 06:59:20 corosync [TOTEM ] entering RECOVERY state. >> 06:59:20 corosync [TOTEM ] position [0] member 10.87.79.59: >> 06:59:20 corosync [TOTEM ] previous ring seq 6984 rep 10.87.79.59 >> 06:59:20 corosync [TOTEM ] aru 0 high delivered 0 received flag 1 >> 06:59:20 corosync [TOTEM ] Did not need to originate any messages in >> recovery. >> 06:59:20 corosync [TOTEM ] got commit token >> 06:59:20 corosync [TOTEM ] Sending initial ORF token >> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 >> retrans queue empty 1 count 0, aru 0 >> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 >> retrans queue empty 1 count 1, aru 0 >> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 >> retrans queue empty 1 count 2, aru 0 >> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 >> retrans queue empty 1 count 3, aru 0 >> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 >> 06:59:20 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 >> aru 0 0 >> 06:59:20 corosync [TOTEM ] Resetting old ring state >> 06:59:20 corosync [TOTEM ] recovery to regular 1-0 >> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1 >> 06:59:20 corosync [SYNC ] This node is within the primary component and >> will provide service. >> 06:59:20 corosync [TOTEM ] entering OPERATIONAL state. >> 06:59:20 corosync [TOTEM ] A processor joined or left the membership and a >> new membership was formed. >> 06:59:20 corosync [SYNC ] confchg entries 1 >> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 = >> 1. >> 06:59:20 corosync [SYNC ] Synchronization barrier completed >> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy CLM >> service) >> 06:59:20 corosync [SYNC ] confchg entries 1 >> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 = >> 1. >> 06:59:20 corosync [SYNC ] Synchronization barrier completed >> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy CLM service) >> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy AMF >> service) >> 06:59:20 corosync [SYNC ] confchg entries 1 >> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 = >> 1. >> 06:59:20 corosync [SYNC ] Synchronization barrier completed >> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy AMF service) >> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy CKPT >> service) >> 06:59:20 corosync [SYNC ] confchg entries 1 >> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 = >> 1. >> 06:59:20 corosync [SYNC ] Synchronization barrier completed >> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy CKPT >> service) >> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy EVT >> service) >> 06:59:20 corosync [SYNC ] confchg entries 1 >> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 = >> 1. >> 06:59:20 corosync [SYNC ] Synchronization barrier completed >> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy EVT service) >> 06:59:20 corosync [SYNC ] Synchronization actions starting for (corosync >> cluster closed process group service v1.01) >> 06:59:20 corosync [CPG ] comparing: sender r(0) ip(10.87.79.59) ; >> members(old:0 left:0) >> 06:59:20 corosync [CPG ] chosen downlist: sender r(0) ip(10.87.79.59) ; >> members(old:0 left:0) >> 06:59:20 corosync [SYNC ] confchg entries 1 >> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268 >> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268 = >> 1. >> 06:59:20 corosync [SYNC ] Synchronization barrier completed >> 06:59:20 corosync [SYNC ] Committing synchronization for (corosync cluster >> closed process group service v1.01) >> 06:59:20 corosync [MAIN ] Completed service synchronization, ready to >> provide service. >> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 0 >> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal >> handler for signal 15 >> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal >> handler for signal 17 >> 06:59:20node03lrmd: [14934]: info: enabling coredumps >> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal >> handler for signal 10 >> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal >> handler for signal 12 >> 06:59:20node03lrmd: [14934]: debug: main: run the loop... >> 06:59:20node03lrmd: [14934]: info: Started. >> 06:59:20 [14935]node03 attrd: info: crm_log_init_worker: Changed >> active directory to /var/lib/heartbeat/cores/hacluster >> 06:59:20 [14935]node03 attrd: info: main: Starting up >> 06:59:20 [14935]node03 attrd: info: get_cluster_type: Cluster >> type is: 'openais' >> 06:59:20 [14935]node03 attrd: notice: crm_cluster_connect: >> Connecting to cluster infrastructure: classic openais (with plugin) >> 06:59:20 [14936]node03 pengine: info: crm_log_init_worker: Changed >> active directory to /var/lib/heartbeat/cores/hacluster >> 06:59:20 [14935]node03 attrd: info: init_ais_connection_classic: >> Creating connection to our Corosync plugin >> 06:59:20 [14936]node03 pengine: debug: main: Checking for old >> instances of pengine >> 06:59:20 [14937]node03 crmd: info: crm_log_init_worker: Changed >> active directory to /var/lib/heartbeat/cores/hacluster >> 06:59:20 [14936]node03 pengine: debug: >> init_client_ipc_comms_nodispatch: Attempting to talk on: >> /var/run/crm/pengine >> 06:59:20 [14937]node03 crmd: notice: main: CRM Hg Version: >> 148fccfd5985c5590cc601123c6c16e966b85d14 >> 06:59:20 [14936]node03 pengine: debug: >> init_client_ipc_comms_nodispatch: Could not init comms on: >> /var/run/crm/pengine >> 06:59:20 [14936]node03 pengine: debug: main: Init server comms >> 06:59:20 [14936]node03 pengine: info: main: Starting pengine >> 06:59:20 [14937]node03 crmd: debug: crmd_init: Starting crmd >> 06:59:20 [14937]node03 crmd: debug: s_crmd_fsa: Processing >> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] >> 06:59:20 [14937]node03 crmd: debug: do_fsa_action: actions:trace: >> // A_LOG >> 06:59:20 [14937]node03 crmd: debug: do_log: FSA: Input I_STARTUP >> from crmd_init() received in state S_STARTING >> 06:59:20 [14937]node03 crmd: debug: do_fsa_action: actions:trace: >> // A_STARTUP >> 06:59:20 [14937]node03 crmd: debug: do_startup: Registering >> Signal Handlers >> 06:59:20 [14937]node03 crmd: debug: do_startup: Creating CIB >> and LRM objects >> 06:59:20 [14937]node03 crmd: debug: do_fsa_action: actions:trace: >> // A_CIB_START >> 06:59:20 [14937]node03 crmd: debug: >> init_client_ipc_comms_nodispatch: Attempting to talk on: >> /var/run/crm/cib_rw >> 06:59:20 [14937]node03 crmd: debug: >> init_client_ipc_comms_nodispatch: Could not init comms on: >> /var/run/crm/cib_rw >> 06:59:20 [14937]node03 crmd: debug: cib_native_signon_raw: >> Connection to command channel failed >> 06:59:20 [14937]node03 crmd: debug: >> init_client_ipc_comms_nodispatch: Attempting to talk on: >> /var/run/crm/cib_callback >> 06:59:20 [14937]node03 crmd: debug: >> init_client_ipc_comms_nodispatch: Could not init comms on: >> /var/run/crm/cib_callback >> 06:59:20 [14937]node03 crmd: debug: cib_native_signon_raw: >> Connection to callback channel failed >> 06:59:20 [14937]node03 crmd: debug: cib_native_signon_raw: >> Connection to CIB failed: connection failed >> 06:59:20 [14937]node03 crmd: debug: cib_native_signoff: Signing >> out of the CIB Service >> 06:59:20 [14935]node03 attrd: debug: init_ais_connection_classic: >> Adding fd=6 to mainloop >> 06:59:20 [14935]node03 attrd: info: init_ais_connection_classic: >> AIS connection established >> 06:59:20 [14935]node03 attrd: info: get_ais_nodeid: Server details: >> id=1003428268 uname=node03 cname=pcmk >> 06:59:20 [14935]node03 attrd: info: init_ais_connection_once: >> Connection to 'classic openais (with plugin)': established >> 06:59:20 [14935]node03 attrd: debug: crm_new_peer: Creating entry >> for node node03/1003428268 >> 06:59:20 [14935]node03 attrd: info: crm_new_peer: Nodenode03now >> has id: 1003428268 >> 06:59:20 [14935]node03 attrd: info: crm_new_peer: Node 1003428268 >> is now known as node03 >> 06:59:20 [14935]node03 attrd: info: main: Cluster connection >> active >> 06:59:20 [14935]node03 attrd: info: main: Accepting attribute >> updates >> 06:59:20 [14935]node03 attrd: notice: main: Starting mainloop... >> 06:59:20 [14933]node03stonith-ng: info: crm_log_init_worker: Changed >> active directory to /var/lib/heartbeat/cores/root >> 06:59:20 [14933]node03stonith-ng: info: get_cluster_type: Cluster >> type is: 'openais' >> 06:59:20 [14933]node03stonith-ng: notice: crm_cluster_connect: >> Connecting to cluster infrastructure: classic openais (with plugin) >> 06:59:20 [14933]node03stonith-ng: info: init_ais_connection_classic: >> Creating connection to our Corosync plugin >> 06:59:20 [14932]node03 cib: info: crm_log_init_worker: Changed >> active directory to /var/lib/heartbeat/cores/hacluster >> 06:59:20 [14932]node03 cib: info: retrieveCib: Reading cluster >> configuration from: /var/lib/heartbeat/crm/cib.xml (digest: >> /var/lib/heartbeat/crm/cib.xml.sig) >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] <cib >> epoch="251" num_updates="0" admin_epoch="1" validate-with="pacemaker-1.2" >> crm_feature_set="3.0.6" update-origin="node03" update-client="crmd" >> cib-last-written="Tue Apr 9 06:48:33 2013" have-quorum="1" > >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <configuration > >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <crm_config > >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <cluster_property_set id="cib-bootstrap-options" > >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <nvpair id="cib-bootstrap-options-default-resource-stickiness" >> name="default-resource-stickiness" value="1000" /> >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <nvpair id="cib-bootstrap-options-no-quorum-policy" >> name="no-quorum-policy" value="ignore" /> >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" >> value="false" /> >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <nvpair id="cib-bootstrap-options-expected-quorum-votes" >> name="expected-quorum-votes" value="3" /> >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" >> value="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" /> >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <nvpair id="cib-bootstrap-options-cluster-infrastructure" >> name="cluster-infrastructure" value="openais" /> >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> <nvpair id="cib-bootstrap-options-last-lrm-refresh" >> name="last-lrm-refresh" value="1365160119" /> >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> </cluster_property_set> >> 06:59:20 [14932]node03 cib: debug: readCibXmlFile: [on-disk] >> </crm_config> >> … >> … >> ... >> >> >> We are still seeing the extra pacemaker daemons when corosync starts up. >> As an added check, all pacemaker daemons exited correctly when stoping >> corosync. >> ldmd attempts to start twice.. >> >> ps aux | grep lrmd >> root 16412 0.0 0.0 0 0 ? Z 07:20 0:00 [lrmd] >> <defunct> >> root 16419 0.0 0.0 34240 1052 ? S 07:20 0:00 >> /usr/lib64/heartbeat/lrmd >> root 21030 0.0 0.0 103244 856 pts/0 S+ 08:37 0:00 grep lrmd >> >> >> Help to resolve this issue appreciated.. >> >> Cheers, >> Jimmy. >> >> >> On 9 Apr 2013, at 00:16, Andrew Beekhof <[email protected]> wrote: >> >>> >>> On 08/04/2013, at 9:44 PM, Jimmy Magee <[email protected]> wrote: >>> >>>> Hi Andrew, >>>> >>>> thanks for your reply, we are running at debug level with the following >>>> config from corosync.conf >>>> >>>> logging { >>>> fileline: off >>>> to_syslog: yes >>>> to_stderr: no >>>> syslog_facility: daemon >>>> debug: on >>>> timestamp: on >>>> } >>>> >>>> Looking at the issue further, there seems to be 2 instances of some >>>> pacemaker daemons running on this particular node…. >>>> >>>> >>>> ps aux | grep pace >>>> >>>> 495 3050 0.2 0.0 89956 7184 ? S 07:10 0:01 >>>> /usr/libexec/pacemaker/cib >>>> root 3051 0.0 0.0 87128 3152 ? S 07:10 0:00 >>>> /usr/libexec/pacemaker/stonithd >>>> 495 3053 0.0 0.0 91188 2840 ? S 07:10 0:00 >>>> /usr/libexec/pacemaker/attrd >>>> 495 3054 0.0 0.0 87336 2484 ? S 07:10 0:00 >>>> /usr/libexec/pacemaker/pengine >>>> 495 3055 0.0 0.0 91332 3156 ? S 07:10 0:00 >>>> /usr/libexec/pacemaker/crmd >>>> 495 3057 0.0 0.0 88876 5224 ? S 07:10 0:00 >>>> /usr/libexec/pacemaker/cib >>>> root 3058 0.0 0.0 87128 3132 ? S 07:10 0:00 >>>> /usr/libexec/pacemaker/stonithd >>>> 495 3060 0.0 0.0 91188 2788 ? S 07:10 0:00 >>>> /usr/libexec/pacemaker/attrd >>>> 495 3062 0.0 0.0 91436 3932 ? S 07:10 0:00 >>>> /usr/libexec/pacemaker/crmd >>>> >>>> >>>> ps aux | grep corosync >>>> root 3044 0.1 0.0 977852 9264 ? Ssl 07:10 0:01 corosync >>>> root 9363 0.0 0.0 103248 856 pts/0 S+ 07:33 0:00 grep >>>> corosync >>>> >>>> >>>> ps aux | grep lrmd >>>> root 3052 0.0 0.0 76464 2528 ? S 07:10 0:00 >>>> /usr/lib64/heartbeat/lrmd >>>> >>>> >>>> Not sure why this is the case? Appreciate any help.. >>>> >>> >>> Have you perhaps specified "ver: 0" for the pacemaker plugin and run >>> "service pacemaker start" ? >>> >>>> Cheers, >>>> Jimmy. >>>> >>>> >>>> >>>> >>>> >>>> On 8 Apr 2013, at 03:00, Andrew Beekhof <[email protected]> wrote: >>>> >>>>> This doesn't look promising: >>>>> >>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>> signal 15 >>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit >>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>> signal 17 >>>>> lrmd: [4939]: info: enabling coredumps >>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>> signal 10 >>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>> signal 12 >>>>> lrmd: [4939]: info: Started. >>>>> lrmd: [4939]: info: lrmd is shutting down >>>>> >>>>> The lrmd comes up but then immediately shuts down. >>>>> Perhaps try enabling debug to see if that sheds any light. >>>>> >>>>> On 06/04/2013, at 4:58 AM, Jimmy Magee <[email protected]> wrote: >>>>> >>>>>> Hi guys, >>>>>> >>>>>> Apologies for reposting this query, it inadvertently got added to an >>>>>> existing topic! >>>>>> >>>>>> >>>>>> We have a three node cluster deployed in a customer's network: >>>>>> - 2 nodes are on the same switch >>>>>> - 3rd node on the same subnet but there's a router in between. >>>>>> - IP Multicast is enabled and has been tested using omping as follows.. >>>>>> >>>>>> On each node ran.. >>>>>> >>>>>> omping node01 node02 node3 >>>>>> >>>>>> >>>>>> ON node 3 >>>>>> >>>>>> Node01 : unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = >>>>>> 0.128/0.181/0.255/0.025 >>>>>> Node01 : multicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = >>>>>> 0.140/0.187/0.219/0.021 >>>>>> Node02 : unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = >>>>>> 0.115/0.150/0.168/0.021 >>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = >>>>>> 0.134/0.162/0.177/0.014 >>>>>> >>>>>> >>>>>> On node 2 >>>>>> >>>>>> >>>>>> Node01 : unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = >>>>>> 0.168/0.191/0.205/0.014 >>>>>> Node01 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), >>>>>> min/avg/max/std-dev = 0.138/0.179/0.206/0.028 >>>>>> Node03 : unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = >>>>>> 0.112/0.149/0.175/0.022 >>>>>> Node03 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), >>>>>> min/avg/max/std-dev = 0.124/0.167/0.178/0.018 >>>>>> >>>>>> >>>>>> >>>>>> On node 1 >>>>>> >>>>>> Node02 : unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = >>>>>> 0.154/0.185/0.208/0.019 >>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = >>>>>> 0.175/0.198/0.214/0.015 >>>>>> Node03 : unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = >>>>>> 0.114/0.160/0.185/0.019 >>>>>> Node03 : multicast, xmt/rcv/%loss = 23/22/4% (seq>=2 0%), >>>>>> min/avg/max/std-dev = 0.124/0.172/0.197/0.019 >>>>>> >>>>>> >>>>>> - Problem is intermittent but frequent. Occasionally starts fine when >>>>>> started from scratch. >>>>>> >>>>>> We suspect the problem is related to node 3 as we can see lrmd failures >>>>>> as per the attached log. We've checked permissions are ok as per >>>>>> https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391 >>>>>> >>>>>> >>>>>> >>>>>> stonith-ng[1437]: error: ais_dispatch: AIS connection failed >>>>>> stonith-ng[1437]: error: stonith_peer_ais_destroy: AIS connection >>>>>> terminated >>>>>> corosync[1430]: [SERV ] Service engine unloaded: Pacemaker Cluster >>>>>> Manager 1.1.6 >>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync extended >>>>>> virtual synchrony service >>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync >>>>>> configuration service >>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync cluster >>>>>> closed process group service v1.01 >>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync cluster >>>>>> config database access v1.01 >>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync profile >>>>>> loading service >>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync cluster >>>>>> quorum service v0.1 >>>>>> corosync[1430]: [MAIN ] Corosync Cluster Engine exiting with status 0 >>>>>> at main.c:1894. >>>>>> >>>>>> corosync[4931]: [MAIN ] Corosync built-in features: nss dbus rdma snmp >>>>>> corosync[4931]: [MAIN ] Successfully read main configuration file >>>>>> '/etc/corosync/corosync.conf'. >>>>>> corosync[4931]: [TOTEM ] Initializing transport (UDP/IP Multicast). >>>>>> corosync[4931]: [TOTEM ] Initializing transmit/receive security: >>>>>> libtomcrypt SOBER128/SHA1HMAC (mode 0). >>>>>> corosync[4931]: [TOTEM ] The network interface [10.87.79.59] is now up. >>>>>> corosync[4931]: [pcmk ] Logging: Initialized pcmk_startup >>>>>> corosync[4931]: [SERV ] Service engine loaded: Pacemaker Cluster >>>>>> Manager 1.1.6 >>>>>> corosync[4931]: [pcmk ] Logging: Initialized pcmk_startup >>>>>> corosync[4931]: [SERV ] Service engine loaded: Pacemaker Cluster >>>>>> Manager 1.1.6 >>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync extended >>>>>> virtual synchrony service >>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync configuration >>>>>> service >>>>>> orosync[4931]: [SERV ] Service engine loaded: corosync cluster closed >>>>>> process group service v1.01 >>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync cluster >>>>>> config database access v1.01 >>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync profile >>>>>> loading service >>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync cluster >>>>>> quorum service v0.1 >>>>>> corosync[4931]: [MAIN ] Compatibility mode set to whitetank. Using >>>>>> V1 and V2 of the synchronization engine. >>>>>> corosync[4931]: [TOTEM ] A processor joined or left the membership and >>>>>> a new membership was formed. >>>>>> corosync[4931]: [CPG ] chosen downlist: sender r(0) ip(10.87.79.59) >>>>>> ; members(old:0 left:0) >>>>>> corosync[4931]: [MAIN ] Completed service synchronization, ready to >>>>>> provide service. >>>>>> cib[4937]: info: crm_log_init_worker: Changed active directory to >>>>>> /var/lib/heartbeat/cores/hacluster >>>>>> cib[4937]: info: retrieveCib: Reading cluster configuration from: >>>>>> /var/lib/heartbeat/crm/cib.xml (digest: >>>>>> /var/lib/heartbeat/crm/cib.xml.sig) >>>>>> cib[4937]: info: validate_with_relaxng: Creating RNG parser context >>>>>> stonith-ng[4945]: info: crm_log_init_worker: Changed active >>>>>> directory to /var/lib/heartbeat/cores/root >>>>>> stonith-ng[4945]: info: get_cluster_type: Cluster type is: 'openais' >>>>>> stonith-ng[4945]: notice: crm_cluster_connect: Connecting to cluster >>>>>> infrastructure: classic openais (with plugin) >>>>>> stonith-ng[4945]: info: init_ais_connection_classic: Creating >>>>>> connection to our Corosync plugin >>>>>> cib[4944]: info: crm_log_init_worker: Changed active directory to >>>>>> /var/lib/heartbeat/cores/hacluster >>>>>> cib[4944]: info: retrieveCib: Reading cluster configuration from: >>>>>> /var/lib/heartbeat/crm/cib.xml (digest: >>>>>> /var/lib/heartbeat/crm/cib.xml.sig) >>>>>> stonith-ng[4945]: info: init_ais_connection_classic: AIS connection >>>>>> established >>>>>> stonith-ng[4945]: info: get_ais_nodeid: Server details: >>>>>> id=1003428268 uname=w0110Danmtapp03 cname=pcmk >>>>>> stonith-ng[4945]: info: init_ais_connection_once: Connection to >>>>>> 'classic openais (with plugin)': established >>>>>> stonith-ng[4945]: info: crm_new_peer: Node node03 now has id: >>>>>> 1003428268 >>>>>> stonith-ng[4945]: info: crm_new_peer: Node 1003428268 is now known >>>>>> as node03 >>>>>> cib[4944]: info: validate_with_relaxng: Creating RNG parser context >>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>> signal 15 >>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit >>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>> signal 17 >>>>>> lrmd: [4939]: info: enabling coredumps >>>>>> stonith-ng[4938]: info: crm_log_init_worker: Changed active >>>>>> directory to /var/lib/heartbeat/cores/root >>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>> signal 10 >>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for >>>>>> signal 12 >>>>>> lrmd: [4939]: info: Started. >>>>>> stonith-ng[4938]: info: get_cluster_type: Cluster type is: 'openais' >>>>>> lrmd: [4939]: info: lrmd is shutting down >>>>>> stonith-ng[4938]: notice: crm_cluster_connect: Connecting to cluster >>>>>> infrastructure: classic openais (with plugin) >>>>>> stonith-ng[4938]: info: init_ais_connection_classic: Creating >>>>>> connection to our Corosync plugin >>>>>> attrd[4940]: info: crm_log_init_worker: Changed active directory to >>>>>> /var/lib/heartbeat/cores/hacluster >>>>>> pengine[4941]: info: crm_log_init_worker: Changed active directory >>>>>> to /var/lib/heartbeat/cores/hacluster >>>>>> attrd[4940]: info: main: Starting up >>>>>> attrd[4940]: info: get_cluster_type: Cluster type is: 'openais' >>>>>> attrd[4940]: notice: crm_cluster_connect: Connecting to cluster >>>>>> infrastructure: classic openais (with plugin) >>>>>> attrd[4940]: info: init_ais_connection_classic: Creating connection >>>>>> to our Corosync plugin >>>>>> crmd[4942]: info: crm_log_init_worker: Changed active directory to >>>>>> /var/lib/heartbeat/cores/hacluster >>>>>> pengine[4941]: info: main: Starting pengine >>>>>> crmd[4942]: notice: main: CRM Hg Version: >>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14 >>>>>> pengine[4948]: info: crm_log_init_worker: Changed active directory >>>>>> to /var/lib/heartbeat/cores/hacluster >>>>>> pengine[4948]: warning: main: Terminating previous PE instance >>>>>> attrd[4947]: info: crm_log_init_worker: Changed active directory to >>>>>> /var/lib/heartbeat/cores/hacluster >>>>>> pengine[4941]: warning: process_pe_message: Received quit message, >>>>>> terminating >>>>>> attrd[4947]: info: main: Starting up >>>>>> attrd[4947]: info: get_cluster_type: Cluster type is: 'openais' >>>>>> attrd[4947]: notice: crm_cluster_connect: Connecting to cluster >>>>>> infrastructure: classic openais (with plugin) >>>>>> attrd[4947]: info: init_ais_connection_classic: Creating connection >>>>>> to our Corosync plugin >>>>>> crmd[4949]: info: crm_log_init_worker: Changed active directory to >>>>>> /var/lib/heartbeat/cores/hacluster >>>>>> crmd[4949]: notice: main: CRM Hg Version: >>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14 >>>>>> stonith-ng[4938]: info: init_ais_connection_classic: AIS connection >>>>>> established >>>>>> stonith-ng[4938]: info: get_ais_nodeid: Server details: >>>>>> id=1003428268 uname=node03 cname=pcmk >>>>>> stonith-ng[4938]: info: init_ais_connection_once: Connection to >>>>>> 'classic openais (with plugin)': established >>>>>> stonith-ng[4938]: info: crm_new_peer: Node node03 now has id: >>>>>> 1003428268 >>>>>> stonith-ng[4938]: info: crm_new_peer: Node 1003428268 is now known >>>>>> as node03 >>>>>> attrd[4940]: info: init_ais_connection_classic: AIS connection >>>>>> established >>>>>> attrd[4940]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>> uname=node03 cname=pcmk >>>>>> attrd[4940]: info: init_ais_connection_once: Connection to 'classic >>>>>> openais (with plugin)': established >>>>>> attrd[4940]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>> attrd[4940]: info: crm_new_peer: Node 1003428268 is now known as >>>>>> node03 >>>>>> attrd[4940]: info: main: Cluster connection active >>>>>> attrd[4940]: info: main: Accepting attribute updates >>>>>> attrd[4940]: notice: main: Starting mainloop... >>>>>> attrd[4947]: info: init_ais_connection_classic: AIS connection >>>>>> established >>>>>> attrd[4947]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>> uname=node03 cname=pcmk >>>>>> attrd[4947]: info: init_ais_connection_once: Connection to 'classic >>>>>> openais (with plugin)': established >>>>>> attrd[4947]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>> attrd[4947]: info: crm_new_peer: Node 1003428268 is now known as >>>>>> node03 >>>>>> attrd[4947]: info: main: Cluster connection active >>>>>> attrd[4947]: info: main: Accepting attribute updates >>>>>> attrd[4947]: notice: main: Starting mainloop... >>>>>> cib[4937]: info: startCib: CIB Initialization completed successfully >>>>>> cib[4937]: info: get_cluster_type: Cluster type is: 'openais' >>>>>> cib[4937]: notice: crm_cluster_connect: Connecting to cluster >>>>>> infrastructure: classic openais (with plugin) >>>>>> cib[4937]: info: init_ais_connection_classic: Creating connection to >>>>>> our Corosync plugin >>>>>> cib[4944]: info: startCib: CIB Initialization completed successfully >>>>>> cib[4944]: info: get_cluster_type: Cluster type is: 'openais' >>>>>> cib[4944]: notice: crm_cluster_connect: Connecting to cluster >>>>>> infrastructure: classic openais (with plugin) >>>>>> cib[4944]: info: init_ais_connection_classic: Creating connection to >>>>>> our Corosync plugin >>>>>> cib[4937]: info: init_ais_connection_classic: AIS connection >>>>>> established >>>>>> cib[4937]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>> uname=node03 cname=pcmk >>>>>> cib[4937]: info: init_ais_connection_once: Connection to 'classic >>>>>> openais (with plugin)': established >>>>>> cib[4937]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>> cib[4937]: info: crm_new_peer: Node 1003428268 is now known as node03 >>>>>> cib[4937]: info: cib_init: Starting cib mainloop >>>>>> cib[4937]: info: ais_dispatch_message: Membership 6892: quorum still >>>>>> lost >>>>>> cib[4937]: info: crm_update_peer: Node node03: id=1003428268 >>>>>> state=member (new) addr=r(0) ip(10.87.79.59) (new) votes=1 (new) born=0 >>>>>> seen=6892 proc=00000000000000000000000000111312 (new) >>>>>> cib[4944]: info: init_ais_connection_classic: AIS connection >>>>>> established >>>>>> cib[4944]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>> uname=node03 cname=pcmk >>>>>> cib[4944]: info: init_ais_connection_once: Connection to 'classic >>>>>> openais (with plugin)': established >>>>>> cib[4944]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>> cib[4944]: info: crm_new_peer: Node 1003428268 is now known as node03 >>>>>> cib[4944]: info: cib_init: Starting cib mainloop >>>>>> stonith-ng[4945]: notice: setup_cib: Watching for stonith topology >>>>>> changes >>>>>> stonith-ng[4945]: info: main: Starting stonith-ng mainloop >>>>>> cib[4937]: info: ais_dispatch_message: Membership 6896: quorum still >>>>>> lost >>>>>> corosync[4931]: [TOTEM ] A processor joined or left the membership and >>>>>> a new membership was formed. >>>>>> cib[4937]: info: crm_new_peer: Node <null> now has id: 969873836 >>>>>> cib[4937]: info: crm_update_peer: Node (null): id=969873836 >>>>>> state=member (new) addr=r(0) ip(172.25.207.57) votes=0 born=0 seen=6896 >>>>>> proc=00000000000000000000000000000000 >>>>>> cib[4937]: info: crm_new_peer: Node <null> now has id: 986651052 >>>>>> cib[4937]: info: crm_update_peer: Node (null): id=986651052 >>>>>> state=member (new) addr=r(0) ip(172.25.207.58) votes=0 born=0 seen=6896 >>>>>> proc=00000000000000000000000000000000 >>>>>> cib[4937]: notice: ais_dispatch_message: Membership 6896: quorum >>>>>> acquired >>>>>> cib[4937]: info: crm_get_peer: Node 986651052 is now known as node02 >>>>>> cib[4937]: info: crm_update_peer: Node node02: id=986651052 >>>>>> state=member addr=r(0) ip(172.25.207.58) votes=1 (new) born=6812 >>>>>> seen=6896 proc=00000000000000000000000000111312 (new) >>>>>> cib[4937]: info: ais_dispatch_message: Membership 6896: quorum >>>>>> retained >>>>>> cib[4937]: info: crm_get_peer: Node 969873836 is now known as node01 >>>>>> cib[4937]: info: crm_update_peer: Node node01: id=969873836 >>>>>> state=member addr=r(0) ip(172.25.207.57) votes=1 (new) born=6848 >>>>>> seen=6896 proc=00000000000000000000000000111312 (new) >>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4931 due to >>>>>> rate-limiting >>>>>> crmd[4942]: info: do_cib_control: CIB connection established >>>>>> crmd[4942]: info: get_cluster_type: Cluster type is: 'openais' >>>>>> crmd[4942]: notice: crm_cluster_connect: Connecting to cluster >>>>>> infrastructure: classic openais (with plugin) >>>>>> crmd[4942]: info: init_ais_connection_classic: Creating connection >>>>>> to our Corosync plugin >>>>>> cib[4937]: info: cib_process_diff: Diff 1.249.28 -> 1.249.29 not >>>>>> applied to 1.249.0: current "num_updates" is less than required >>>>>> cib[4937]: info: cib_server_process_diff: Requesting re-sync from >>>>>> peer >>>>>> crmd[4949]: info: do_cib_control: CIB connection established >>>>>> crmd[4949]: info: get_cluster_type: Cluster type is: 'openais' >>>>>> crmd[4949]: notice: crm_cluster_connect: Connecting to cluster >>>>>> infrastructure: classic openais (with plugin) >>>>>> crmd[4949]: info: init_ais_connection_classic: Creating connection >>>>>> to our Corosync plugin >>>>>> stonith-ng[4938]: notice: setup_cib: Watching for stonith topology >>>>>> changes >>>>>> stonith-ng[4938]: info: main: Starting stonith-ng mainloop >>>>>> cib[4937]: notice: cib_server_process_diff: Not applying diff 1.249.29 >>>>>> -> 1.249.30 (sync in progress) >>>>>> crmd[4942]: info: init_ais_connection_classic: AIS connection >>>>>> established >>>>>> crmd[4942]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>> uname=node03 cname=pcmk >>>>>> crmd[4942]: info: init_ais_connection_once: Connection to 'classic >>>>>> openais (with plugin)': established >>>>>> crmd[4942]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>> crmd[4942]: info: crm_new_peer: Node 1003428268 is now known as >>>>>> node03 >>>>>> crmd[4942]: info: ais_status_callback: status: node03 is now unknown >>>>>> crmd[4942]: info: do_ha_control: Connected to the cluster >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 1 (30 >>>>>> max) times >>>>>> crmd[4949]: info: init_ais_connection_classic: AIS connection >>>>>> established >>>>>> crmd[4949]: info: get_ais_nodeid: Server details: id=1003428268 >>>>>> uname=node03 cname=pcmk >>>>>> crmd[4949]: info: init_ais_connection_once: Connection to 'classic >>>>>> openais (with plugin)': established >>>>>> crmd[4942]: notice: ais_dispatch_message: Membership 6896: quorum >>>>>> acquired >>>>>> crmd[4949]: info: crm_new_peer: Node node03 now has id: 1003428268 >>>>>> crmd[4949]: info: crm_new_peer: Node 1003428268 is now known as >>>>>> node03 >>>>>> crmd[4942]: info: crm_new_peer: Node node01 now has id: 969873836 >>>>>> crmd[4949]: info: ais_status_callback: status: node03 is now unknown >>>>>> crmd[4942]: info: crm_new_peer: Node 969873836 is now known as node01 >>>>>> crmd[4949]: info: do_ha_control: Connected to the cluster >>>>>> crmd[4942]: info: ais_status_callback: status: node01 is now unknown >>>>>> crmd[4942]: info: ais_status_callback: status: node01 is now member >>>>>> (was unknown) >>>>>> crmd[4942]: info: crm_update_peer: Node node01: id=969873836 >>>>>> state=member (new) addr=r(0) ip(172.25.207.57) votes=1 born=6848 >>>>>> seen=6896 proc=00000000000000000000000000111312 >>>>>> crmd[4942]: info: crm_new_peer: Node node02 now has id: 986651052 >>>>>> crmd[4942]: info: crm_new_peer: Node 986651052 is now known as node02 >>>>>> crmd[4942]: info: ais_status_callback: status: node02 is now unknown >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 1 (30 >>>>>> max) times >>>>>> crmd[4942]: info: ais_status_callback: status: node02 is now member >>>>>> (was unknown) >>>>>> crmd[4942]: info: crm_update_peer: Node node02: id=986651052 >>>>>> state=member (new) addr=r(0) ip(172.25.207.58) votes=1 born=6812 >>>>>> seen=6896 proc=00000000000000000000000000111312 >>>>>> crmd[4942]: notice: crmd_peer_update: Status update: Client >>>>>> node03/crmd now has status [online] (DC=<null>) >>>>>> crmd[4942]: info: ais_status_callback: status: node03 is now member >>>>>> (was unknown) >>>>>> crmd[4942]: info: crm_update_peer: Node node03: id=1003428268 >>>>>> state=member (new) addr=r(0) ip(10.87.79.59) (new) votes=1 (new) >>>>>> born=6896 seen=6896 proc=00000000000000000000000000111312 (new) >>>>>> crmd[4942]: info: ais_dispatch_message: Membership 6896: quorum >>>>>> retained >>>>>> cib[4937]: notice: cib_server_process_diff: Not applying diff 1.249.30 >>>>>> -> 1.249.31 (sync in progress) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 2 (30 >>>>>> max) times >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 3 (30 >>>>>> max) times >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 2 (30 >>>>>> max) times >>>>>> crmd[4949]: notice: ais_dispatch_message: Membership 6896: quorum >>>>>> acquired >>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4937 due to >>>>>> rate-limiting >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 4 (30 >>>>>> max) times >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 5 (30 >>>>>> max) times >>>>>> pengine[4948]: info: main: Starting pengine >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> warning: do_lrm_control: Failed to sign on to the LRM 6 (30 max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 3 (30 >>>>>> max) times >>>>>> attrd[4940]: info: cib_connect: Connected to the CIB after 1 signon >>>>>> attempts >>>>>> attrd[4940]: info: cib_connect: Sending full refresh >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 7 (30 >>>>>> max) times >>>>>> attrd[4947]: info: cib_connect: Connected to the CIB after 1 signon >>>>>> attempts >>>>>> attrd[4947]: info: cib_connect: Sending full refresh >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 4 (30 >>>>>> max) times >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 8 (30 >>>>>> max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 5 (30 >>>>>> max) times >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 9 (30 >>>>>> max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 6 (30 >>>>>> max) times >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 10 >>>>>> (30 max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 7 (30 >>>>>> max) times >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 11 >>>>>> (30 max) times >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 8 (30 >>>>>> max) times >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 12 >>>>>> (30 max) times >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 9 (30 >>>>>> max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 13 >>>>>> (30 max) times >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 10 >>>>>> (30 max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 14 >>>>>> (30 max) times >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 11 >>>>>> (30 max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 12 >>>>>> (30 max) times >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 15 >>>>>> (30 max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 13 >>>>>> (30 max) times >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 16 >>>>>> (30 max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 14 >>>>>> (30 max) times >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 17 >>>>>> (30 max) times >>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 15 >>>>>> (30 max) times >>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped >>>>>> (2000ms) >>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 18 >>>>>> (30 max) times >>>>>> >>>>>> >>>>>> We have the following components installed.. >>>>>> >>>>>> >>>>>> corosynclib-1.4.1-15.el6.x86_64 >>>>>> corosync-1.4.1-15.el6.x86_64 >>>>>> cluster-glue-libs-1.0.5-6.el6.x86_64 >>>>>> clusterlib-3.0.12.1-49.el6.x86_64 >>>>>> pacemaker-cluster-libs-1.1.7-6.el6.x86_64 >>>>>> cluster-glue-1.0.5-6.el6.x86_64 >>>>>> resource-agents-3.9.2-12.el6.x86_64 >>>>>> >>>>>> >>>>>> >>>>>> We'd appreciate assistance on how to debug what the issue may be and >>>>>> some possible causes. >>>>>> >>>>>> Cheers, >>>>>> Jimmy >>>>>> _______________________________________________ >>>>>> Linux-HA mailing list >>>>>> [email protected] >>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>>> See also: http://linux-ha.org/ReportingProblems >>>>> >>>>> _______________________________________________ >>>>> Linux-HA mailing list >>>>> [email protected] >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>>> See also: http://linux-ha.org/ReportingProblems >>>> >>>> _______________________________________________ >>>> Linux-HA mailing list >>>> [email protected] >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>>> See also: http://linux-ha.org/ReportingProblems >>> >>> _______________________________________________ >>> Linux-HA mailing list >>> [email protected] >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha >>> See also: http://linux-ha.org/ReportingProblems >> > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
