Re: [Linux-HA] Failed to sign on to the LRM error on Corosync Startup

Jimmy Magee Mon, 15 Apr 2013 14:41:34 -0700

Hi Andrew, 

Thanks for your reply, we tried that option but to no avail.


To resolve the issue what worked for us was to remove existing ha packages and 
update to pacemaker to 1.18-7.

Here is the procedure…

1.      Backup /etc/corosync/corosync.conf, /etc/corosync/authkey.
2.      Export cib.xml:
                cibadmin -Q > /tmp/ha_backup/cib.xml
3.      Stop corosync services on all nodes
4.      Remove existing HA packages:
                yum -y remove pacemaker corosync heartbeat resource-agents 
cluster-glue rgmanager lvm2-cluster gfs2-utils
5.      Install updated HA Packages:
                yum -y install pacemaker cman ccs resource agents

                resulting in the following packages being installed..
                pacemaker-doc-1.1.8-7.el6.x86_64
                pacemaker-cli-1.1.8-7.el6.x86_64
                pacemaker-libs-1.1.8-7.el6.x86_64
                pacemaker-cts-1.1.8-7.el6.x86_64
                pacemaker-libs-devel-1.1.8-7.el6.x86_64
                pacemaker-cluster-libs-1.1.8-7.el6.x86_64
                pacemaker-1.1.8-7.el6.x86_64
                pacemaker-debuginfo-1.1.8-7.el6.x86_64
                cman-3.0.12.1-49.el6.x86_64
                ccs-0.16.2-55.el6.x86_64
                resource-agents-3.9.2-12.el6.x86_64
                cluster-glue-libs-1.0.5-6.el6.x86_64
                corosync-1.4.1-15.el6.x86_64
                corosynclib-1.4.1-15.el6.x86_64
                corosync-debuginfo-1.4.1-15.el6.x86_64
                corosynclib-devel-1.4.1-15.el6.x86_64

6.      Get crm package and install:
                 yum -y install crmsh*
7.      Start the ricci service:
                service ricci start
        Also ensure it restarts:
                chkconfig --add ricci
8.      Set ricci passwd:
                passwd ricci
9.      Configure the cluster:
                ccs -f /etc/cluster/cluster.conf --createcluster testprod -i
                ccs -f /etc/cluster/cluster.conf --addnode w0110Danmtapp01
                ccs -f /etc/cluster/cluster.conf --addnode node02
                ccs -f /etc/cluster/cluster.conf --addnode node03
                ccs -f /etc/cluster/cluster.conf --addfencedev pcmk 
agent=fence_pcmk
                ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect 
node01
                ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect 
node02
                ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect 
node03
                ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node01 
pcmk-redirect port=1
                ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node02 
pcmk-redirect port=2
                ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node03 
pcmk-redirect port=3
                ccs -f /etc/cluster/cluster.conf --setlogging debug=on
                ccs -f /etc/cluster/cluster.conf --settotem 
10.     Distribute the cluster.conf:
                ccs -h node01 -p ************** --sync --activate
11.     Set Cluster timeout to 0 on all three nodes separately:
                echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/sysconfig/cman
12.     Start the services on each node:
        service cman start
        service pacemaker start
        also ensure they restart:
                chkconfig --add cman
        chkconfig --add pacemaker


Best of luck,
Jimmy.




On 12 Apr 2013, at 02:11, Andrew Beekhof <[email protected]> wrote:

> 
> On 11/04/2013, at 6:05 AM, Jimmy Magee <[email protected]> wrote:
> 
>> Hi,
>> 
>> Following up on the above thread, any thoughts as to what may be causing the 
>> issue..
> 
> One of the main reasons pacemakerd was created was to avoid weirdness around 
> the starting of pacemaker's child processes from within a multi-threaded 
> application like corosync... which is almost certainly what you're bumping 
> into here.
> 
> Could you try using "ver: 1" in corosync.conf and "service pacemaker start" 
> to rule out any other causes?
> 
>> 
>> Cheers,
>> Jimmy.
>> 
>> 
>> 
>> On 9 Apr 2013, at 13:39, Jimmy Magee <[email protected]> wrote:
>> 
>>> Hi Andrew,
>>> 
>>> The corosync.conf is configured as follows:
>>> 
>>> 
>>>> service {
>>>>               # Load the Pacemaker Cluster Resource Manager
>>>>                name: pacemaker
>>>>                ver:  0
>>>>       }
>>> 
>>> 
>>> 
>>> and pacemaker is not started via service pacemaker start…
>>> 
>>> here is the extract from the logs with extra debug when attempting to start 
>>> corosync/pacemaker..
>>> 
>>> 06:59:20 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and 
>>> ready to provide service.
>>> 06:59:20 corosync [MAIN  ] Corosync built-in features: nss dbus rdma snmp
>>> 06:59:20 corosync [MAIN  ] Successfully read main configuration file 
>>> '/etc/corosync/corosync.conf'.
>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1
>>> 06:59:20 corosync [TOTEM ] Token Timeout (5000 ms) retransmit timeout (247 
>>> ms)
>>> 06:59:20 corosync [TOTEM ] token hold (187 ms) retransmits before loss (20 
>>> retrans)
>>> 06:59:20 corosync [TOTEM ] join (1000 ms) send_join (0 ms) consensus (7500 
>>> ms) merge (200 ms)
>>> 06:59:20 corosync [TOTEM ] downcheck (1000 ms) fail to recv const (2500 
>>> msgs)
>>> 06:59:20 corosync [TOTEM ] seqno unchanged const (30 rotations) Maximum 
>>> network MTU 1402
>>> 06:59:20 corosync [TOTEM ] window size per rotation (50 messages) maximum 
>>> messages per rotation (20 messages)
>>> 06:59:20 corosync [TOTEM ] missed count const (5 messages)
>>> 06:59:20 corosync [TOTEM ] send threads (0 threads)
>>> 06:59:20 corosync [TOTEM ] RRP token expired timeout (247 ms)
>>> 06:59:20 corosync [TOTEM ] RRP token problem counter (2000 ms)
>>> 06:59:20 corosync [TOTEM ] RRP threshold (10 problem count)
>>> 06:59:20 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>>> 06:59:20 corosync [TOTEM ] RRP automatic recovery check timeout (1000 ms)
>>> 06:59:20 corosync [TOTEM ] RRP mode set to none.
>>> 06:59:20 corosync [TOTEM ] heartbeat_failures_allowed (0)
>>> 06:59:20 corosync [TOTEM ] max_network_delay (50 ms)
>>> 06:59:20 corosync [TOTEM ] HeartBeat is Disabled. To enable set 
>>> heartbeat_failures_allowed > 0
>>> 06:59:20 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
>>> 06:59:20 corosync [TOTEM ] Initializing transmit/receive security: 
>>> libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>> 06:59:20 corosync [IPC   ] you are using ipc api v2
>>> 06:59:20 corosync [TOTEM ] Receive multicast socket recv buffer size 
>>> (320000 bytes).
>>> 06:59:20 corosync [TOTEM ] Transmit multicast socket send buffer size 
>>> (320000 bytes).
>>> 06:59:20 corosync [TOTEM ] Local receive multicast loop socket recv buffer 
>>> size (320000 bytes).
>>> 06:59:20 corosync [TOTEM ] Local transmit multicast loop socket send buffer 
>>> size (320000 bytes).
>>> 06:59:20 corosync [TOTEM ] The network interface [10.87.79.59] is now up.
>>> 06:59:20 corosync [TOTEM ] Created or loaded sequence id 6984.10.87.79.59 
>>> for this ring.
>>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log
>>> 06:59:20 corosync [pcmk  ] Logging: Initialized pcmk_startup
>>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log
>>> 06:59:20 corosync [SERV  ] Service engine loaded: Pacemaker Cluster Manager 
>>> 1.1.6
>>> 06:59:20 corosync [pcmk  ] Logging: Initialized pcmk_startup
>>> 06:59:20 corosync [SERV  ] Service engine loaded: Pacemaker Cluster Manager 
>>> 1.1.6
>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync extended virtual 
>>> synchrony service
>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync configuration 
>>> service
>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync cluster closed 
>>> process group service v1.01
>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync cluster config 
>>> database access v1.01
>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync profile loading 
>>> service
>>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync cluster quorum 
>>> service v0.1
>>> 06:59:20 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 
>>> and V2 of the synchronization engine.
>>> 06:59:20 corosync [TOTEM ] entering GATHER state from 15.
>>> 06:59:20 corosync [TOTEM ] Creating commit token because I am the rep.
>>> 06:59:20 corosync [TOTEM ] Saving state aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] Storing new sequence id for ring 1b4c
>>> 06:59:20 corosync [TOTEM ] entering COMMIT state.
>>> 06:59:20 corosync [TOTEM ] got commit token
>>> 06:59:20 corosync [TOTEM ] entering RECOVERY state.
>>> 06:59:20 corosync [TOTEM ] position [0] member 10.87.79.59:
>>> 06:59:20 corosync [TOTEM ] previous ring seq 6984 rep 10.87.79.59
>>> 06:59:20 corosync [TOTEM ] aru 0 high delivered 0 received flag 1
>>> 06:59:20 corosync [TOTEM ] Did not need to originate any messages in 
>>> recovery.
>>> 06:59:20 corosync [TOTEM ] got commit token
>>> 06:59:20 corosync [TOTEM ] Sending initial ORF token
>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>>> retrans queue empty 1 count 0, aru 0
>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>>> retrans queue empty 1 count 1, aru 0
>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>>> retrans queue empty 1 count 2, aru 0
>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>>> retrans queue empty 1 count 3, aru 0
>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 
>>> aru 0 0
>>> 06:59:20 corosync [TOTEM ] Resetting old ring state
>>> 06:59:20 corosync [TOTEM ] recovery to regular 1-0
>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1
>>> 06:59:20 corosync [SYNC  ] This node is within the primary component and 
>>> will provide service.
>>> 06:59:20 corosync [TOTEM ] entering OPERATIONAL state.
>>> 06:59:20 corosync [TOTEM ] A processor joined or left the membership and a 
>>> new membership was formed.
>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>> = 1. 
>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy CLM 
>>> service)
>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>> = 1. 
>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy CLM 
>>> service)
>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy AMF 
>>> service)
>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>> = 1. 
>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy AMF 
>>> service)
>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy CKPT 
>>> service)
>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>> = 1. 
>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy CKPT 
>>> service)
>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy EVT 
>>> service)
>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>> = 1. 
>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy EVT 
>>> service)
>>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (corosync 
>>> cluster closed process group service v1.01)
>>> 06:59:20 corosync [CPG   ] comparing: sender r(0) ip(10.87.79.59) ; 
>>> members(old:0 left:0)
>>> 06:59:20 corosync [CPG   ] chosen downlist: sender r(0) ip(10.87.79.59) ; 
>>> members(old:0 left:0)
>>> 06:59:20 corosync [SYNC  ] confchg entries 1
>>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 
>>> = 1. 
>>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC  ] Committing synchronization for (corosync cluster 
>>> closed process group service v1.01)
>>> 06:59:20 corosync [MAIN  ] Completed service synchronization, ready to 
>>> provide service.
>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 0
>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>>> handler for signal 15
>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>>> handler for signal 17
>>> 06:59:20node03lrmd: [14934]: info: enabling coredumps
>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>>> handler for signal 10
>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>>> handler for signal 12
>>> 06:59:20node03lrmd: [14934]: debug: main: run the loop...
>>> 06:59:20node03lrmd: [14934]: info: Started.
>>> 06:59:20 [14935]node03     attrd:     info: crm_log_init_worker:    Changed 
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> 06:59:20 [14935]node03     attrd:     info: main:   Starting up
>>> 06:59:20 [14935]node03     attrd:     info: get_cluster_type:       Cluster 
>>> type is: 'openais'
>>> 06:59:20 [14935]node03     attrd:   notice: crm_cluster_connect:    
>>> Connecting to cluster infrastructure: classic openais (with plugin)
>>> 06:59:20 [14936]node03   pengine:     info: crm_log_init_worker:    Changed 
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> 06:59:20 [14935]node03     attrd:     info: init_ais_connection_classic:    
>>> Creating connection to our Corosync plugin
>>> 06:59:20 [14936]node03   pengine:    debug: main:   Checking for old 
>>> instances of pengine
>>> 06:59:20 [14937]node03      crmd:     info: crm_log_init_worker:    Changed 
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> 06:59:20 [14936]node03   pengine:    debug: 
>>> init_client_ipc_comms_nodispatch:       Attempting to talk on: 
>>> /var/run/crm/pengine
>>> 06:59:20 [14937]node03      crmd:   notice: main:   CRM Hg Version: 
>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>> 06:59:20 [14936]node03   pengine:    debug: 
>>> init_client_ipc_comms_nodispatch:       Could not init comms on: 
>>> /var/run/crm/pengine
>>> 06:59:20 [14936]node03   pengine:    debug: main:   Init server comms
>>> 06:59:20 [14936]node03   pengine:     info: main:   Starting pengine
>>> 06:59:20 [14937]node03      crmd:    debug: crmd_init:      Starting crmd
>>> 06:59:20 [14937]node03      crmd:    debug: s_crmd_fsa:     Processing 
>>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
>>> 06:59:20 [14937]node03      crmd:    debug: do_fsa_action:  actions:trace:  
>>> // A_LOG   
>>> 06:59:20 [14937]node03      crmd:    debug: do_log:         FSA: Input 
>>> I_STARTUP from crmd_init() received in state S_STARTING
>>> 06:59:20 [14937]node03      crmd:    debug: do_fsa_action:  actions:trace:  
>>> // A_STARTUP
>>> 06:59:20 [14937]node03      crmd:    debug: do_startup:     Registering 
>>> Signal Handlers
>>> 06:59:20 [14937]node03      crmd:    debug: do_startup:     Creating CIB 
>>> and LRM objects
>>> 06:59:20 [14937]node03      crmd:    debug: do_fsa_action:  actions:trace:  
>>> // A_CIB_START
>>> 06:59:20 [14937]node03      crmd:    debug: 
>>> init_client_ipc_comms_nodispatch:       Attempting to talk on: 
>>> /var/run/crm/cib_rw
>>> 06:59:20 [14937]node03      crmd:    debug: 
>>> init_client_ipc_comms_nodispatch:       Could not init comms on: 
>>> /var/run/crm/cib_rw
>>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signon_raw:  
>>> Connection to command channel failed
>>> 06:59:20 [14937]node03      crmd:    debug: 
>>> init_client_ipc_comms_nodispatch:       Attempting to talk on: 
>>> /var/run/crm/cib_callback
>>> 06:59:20 [14937]node03      crmd:    debug: 
>>> init_client_ipc_comms_nodispatch:       Could not init comms on: 
>>> /var/run/crm/cib_callback
>>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signon_raw:  
>>> Connection to callback channel failed
>>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signon_raw:  
>>> Connection to CIB failed: connection failed
>>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signoff:     Signing 
>>> out of the CIB Service
>>> 06:59:20 [14935]node03     attrd:    debug: init_ais_connection_classic:    
>>> Adding fd=6 to mainloop
>>> 06:59:20 [14935]node03     attrd:     info: init_ais_connection_classic:    
>>> AIS connection established
>>> 06:59:20 [14935]node03     attrd:     info: get_ais_nodeid:         Server 
>>> details: id=1003428268 uname=node03 cname=pcmk
>>> 06:59:20 [14935]node03     attrd:     info: init_ais_connection_once:       
>>> Connection to 'classic openais (with plugin)': established
>>> 06:59:20 [14935]node03     attrd:    debug: crm_new_peer:   Creating entry 
>>> for node node03/1003428268
>>> 06:59:20 [14935]node03     attrd:     info: crm_new_peer:   Nodenode03now 
>>> has id: 1003428268
>>> 06:59:20 [14935]node03     attrd:     info: crm_new_peer:   Node 1003428268 
>>> is now known as node03
>>> 06:59:20 [14935]node03     attrd:     info: main:   Cluster connection 
>>> active
>>> 06:59:20 [14935]node03     attrd:     info: main:   Accepting attribute 
>>> updates
>>> 06:59:20 [14935]node03     attrd:   notice: main:   Starting mainloop...
>>> 06:59:20 [14933]node03stonith-ng:     info: crm_log_init_worker:    Changed 
>>> active directory to /var/lib/heartbeat/cores/root
>>> 06:59:20 [14933]node03stonith-ng:     info: get_cluster_type:       Cluster 
>>> type is: 'openais'
>>> 06:59:20 [14933]node03stonith-ng:   notice: crm_cluster_connect:    
>>> Connecting to cluster infrastructure: classic openais (with plugin)
>>> 06:59:20 [14933]node03stonith-ng:     info: init_ais_connection_classic:    
>>> Creating connection to our Corosync plugin
>>> 06:59:20 [14932]node03       cib:     info: crm_log_init_worker:    Changed 
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> 06:59:20 [14932]node03       cib:     info: retrieveCib:    Reading cluster 
>>> configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk] <cib epoch="251" num_updates="0" admin_epoch="1" 
>>> validate-with="pacemaker-1.2" crm_feature_set="3.0.6" 
>>> update-origin="node03" update-client="crmd" cib-last-written="Tue Apr  9 
>>> 06:48:33 2013" have-quorum="1" >
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]   <configuration >
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]     <crm_config >
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]       <cluster_property_set id="cib-bootstrap-options" >
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]         <nvpair 
>>> id="cib-bootstrap-options-default-resource-stickiness" 
>>> name="default-resource-stickiness" value="1000" />
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]         <nvpair id="cib-bootstrap-options-no-quorum-policy" 
>>> name="no-quorum-policy" value="ignore" />
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]         <nvpair id="cib-bootstrap-options-stonith-enabled" 
>>> name="stonith-enabled" value="false" />
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]         <nvpair id="cib-bootstrap-options-expected-quorum-votes" 
>>> name="expected-quorum-votes" value="3" />
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]         <nvpair id="cib-bootstrap-options-dc-version" 
>>> name="dc-version" 
>>> value="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" />
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]         <nvpair id="cib-bootstrap-options-cluster-infrastructure" 
>>> name="cluster-infrastructure" value="openais" />
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]         <nvpair id="cib-bootstrap-options-last-lrm-refresh" 
>>> name="last-lrm-refresh" value="1365160119" />
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]       </cluster_property_set>
>>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:         
>>> [on-disk]     </crm_config>
>>> …
>>> …
>>> ...
>>> 
>>> 
>>> We are still seeing the extra pacemaker daemons when corosync starts up.
>>> As an added check, all pacemaker daemons exited correctly when stoping 
>>> corosync.
>>> ldmd attempts to start twice..
>>> 
>>> ps aux | grep lrmd
>>> root     16412  0.0  0.0      0     0 ?        Z    07:20   0:00 [lrmd] 
>>> <defunct>
>>> root     16419  0.0  0.0  34240  1052 ?        S    07:20   0:00 
>>> /usr/lib64/heartbeat/lrmd
>>> root     21030  0.0  0.0 103244   856 pts/0    S+   08:37   0:00 grep lrmd
>>> 
>>> 
>>> Help to resolve this issue appreciated..
>>> 
>>> Cheers,
>>> Jimmy.
>>> 
>>> 
>>> On 9 Apr 2013, at 00:16, Andrew Beekhof <[email protected]> wrote:
>>> 
>>>> 
>>>> On 08/04/2013, at 9:44 PM, Jimmy Magee <[email protected]> wrote:
>>>> 
>>>>> Hi Andrew,
>>>>> 
>>>>> thanks for your reply, we are running at debug level with the following 
>>>>> config from corosync.conf
>>>>> 
>>>>> logging {
>>>>>             fileline: off
>>>>>             to_syslog: yes
>>>>>             to_stderr: no
>>>>>             syslog_facility: daemon
>>>>>             debug: on
>>>>>             timestamp: on
>>>>>      }
>>>>> 
>>>>> Looking at the issue further, there seems to be 2 instances of some 
>>>>> pacemaker daemons running on this particular node….
>>>>> 
>>>>> 
>>>>> ps aux | grep pace
>>>>> 
>>>>> 495       3050  0.2  0.0  89956  7184 ?        S    07:10   0:01 
>>>>> /usr/libexec/pacemaker/cib
>>>>> root      3051  0.0  0.0  87128  3152 ?        S    07:10   0:00 
>>>>> /usr/libexec/pacemaker/stonithd
>>>>> 495       3053  0.0  0.0  91188  2840 ?        S    07:10   0:00 
>>>>> /usr/libexec/pacemaker/attrd
>>>>> 495       3054  0.0  0.0  87336  2484 ?        S    07:10   0:00 
>>>>> /usr/libexec/pacemaker/pengine
>>>>> 495       3055  0.0  0.0  91332  3156 ?        S    07:10   0:00 
>>>>> /usr/libexec/pacemaker/crmd
>>>>> 495       3057  0.0  0.0  88876  5224 ?        S    07:10   0:00 
>>>>> /usr/libexec/pacemaker/cib
>>>>> root      3058  0.0  0.0  87128  3132 ?        S    07:10   0:00 
>>>>> /usr/libexec/pacemaker/stonithd
>>>>> 495       3060  0.0  0.0  91188  2788 ?        S    07:10   0:00 
>>>>> /usr/libexec/pacemaker/attrd
>>>>> 495       3062  0.0  0.0  91436  3932 ?        S    07:10   0:00 
>>>>> /usr/libexec/pacemaker/crmd
>>>>> 
>>>>> 
>>>>> ps aux | grep corosync
>>>>> root      3044  0.1  0.0 977852  9264 ?        Ssl  07:10   0:01 corosync
>>>>> root      9363  0.0  0.0 103248   856 pts/0    S+   07:33   0:00 grep 
>>>>> corosync
>>>>> 
>>>>> 
>>>>> ps aux | grep lrmd
>>>>> root      3052  0.0  0.0  76464  2528 ?        S    07:10   0:00 
>>>>> /usr/lib64/heartbeat/lrmd
>>>>> 
>>>>> 
>>>>> Not sure why this is the case? Appreciate any help..
>>>>> 
>>>> 
>>>> Have you perhaps specified "ver: 0" for the pacemaker plugin and run 
>>>> "service pacemaker start" ?
>>>> 
>>>>> Cheers,
>>>>> Jimmy.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 8 Apr 2013, at 03:00, Andrew Beekhof <[email protected]> wrote:
>>>>> 
>>>>>> This doesn't look promising:
>>>>>> 
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>> signal 15
>>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>> signal 17
>>>>>> lrmd: [4939]: info: enabling coredumps
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>> signal 10
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>> signal 12
>>>>>> lrmd: [4939]: info: Started.
>>>>>> lrmd: [4939]: info: lrmd is shutting down
>>>>>> 
>>>>>> The lrmd comes up but then immediately shuts down.
>>>>>> Perhaps try enabling debug to see if that sheds any light.
>>>>>> 
>>>>>> On 06/04/2013, at 4:58 AM, Jimmy Magee <[email protected]> wrote:
>>>>>> 
>>>>>>> Hi guys,
>>>>>>> 
>>>>>>> Apologies for reposting this query, it inadvertently got added to an 
>>>>>>> existing topic!
>>>>>>> 
>>>>>>> 
>>>>>>> We have a three node cluster deployed in a customer's network:
>>>>>>> - 2 nodes are on the same switch
>>>>>>> - 3rd node on the same subnet but there's a router in between.
>>>>>>> - IP Multicast is enabled and has been tested using omping as follows..
>>>>>>> 
>>>>>>> On each node ran..
>>>>>>> 
>>>>>>> omping node01 node02 node3
>>>>>>> 
>>>>>>> 
>>>>>>> ON node 3
>>>>>>> 
>>>>>>> Node01 :   unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 
>>>>>>> 0.128/0.181/0.255/0.025
>>>>>>> Node01 : multicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 
>>>>>>> 0.140/0.187/0.219/0.021
>>>>>>> Node02 :   unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>>> 0.115/0.150/0.168/0.021
>>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>>> 0.134/0.162/0.177/0.014
>>>>>>> 
>>>>>>> 
>>>>>>> On node 2
>>>>>>> 
>>>>>>> 
>>>>>>> Node01 :   unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = 
>>>>>>> 0.168/0.191/0.205/0.014
>>>>>>> Node01 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), 
>>>>>>> min/avg/max/std-dev = 0.138/0.179/0.206/0.028
>>>>>>> Node03 :   unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = 
>>>>>>> 0.112/0.149/0.175/0.022
>>>>>>> Node03 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), 
>>>>>>> min/avg/max/std-dev = 0.124/0.167/0.178/0.018
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On node 1
>>>>>>> 
>>>>>>> Node02 :   unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>>> 0.154/0.185/0.208/0.019
>>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>>> 0.175/0.198/0.214/0.015
>>>>>>> Node03 :   unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 
>>>>>>> 0.114/0.160/0.185/0.019
>>>>>>> Node03 : multicast, xmt/rcv/%loss = 23/22/4% (seq>=2 0%), 
>>>>>>> min/avg/max/std-dev = 0.124/0.172/0.197/0.019
>>>>>>> 
>>>>>>> 
>>>>>>> - Problem is intermittent but frequent. Occasionally starts fine when 
>>>>>>> started from scratch.
>>>>>>> 
>>>>>>> We suspect the problem is related to node 3 as we can see lrmd failures 
>>>>>>> as per the attached log. We've checked permissions are ok as per 
>>>>>>> https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> stonith-ng[1437]:    error: ais_dispatch: AIS connection failed
>>>>>>> stonith-ng[1437]:    error: stonith_peer_ais_destroy: AIS connection 
>>>>>>> terminated
>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: Pacemaker Cluster 
>>>>>>> Manager 1.1.6
>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync extended 
>>>>>>> virtual synchrony service
>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync 
>>>>>>> configuration service
>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync cluster 
>>>>>>> closed process group service v1.01
>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync cluster 
>>>>>>> config database access v1.01
>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync profile 
>>>>>>> loading service
>>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync cluster 
>>>>>>> quorum service v0.1
>>>>>>> corosync[1430]:   [MAIN  ] Corosync Cluster Engine exiting with status 
>>>>>>> 0 at main.c:1894.
>>>>>>> 
>>>>>>> corosync[4931]:   [MAIN  ] Corosync built-in features: nss dbus rdma 
>>>>>>> snmp
>>>>>>> corosync[4931]:   [MAIN  ] Successfully read main configuration file 
>>>>>>> '/etc/corosync/corosync.conf'.
>>>>>>> corosync[4931]:   [TOTEM ] Initializing transport (UDP/IP Multicast).
>>>>>>> corosync[4931]:   [TOTEM ] Initializing transmit/receive security: 
>>>>>>> libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>>>>>> corosync[4931]:   [TOTEM ] The network interface [10.87.79.59] is now 
>>>>>>> up.
>>>>>>> corosync[4931]:   [pcmk  ] Logging: Initialized pcmk_startup
>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: Pacemaker Cluster 
>>>>>>> Manager 1.1.6
>>>>>>> corosync[4931]:   [pcmk  ] Logging: Initialized pcmk_startup
>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: Pacemaker Cluster 
>>>>>>> Manager 1.1.6
>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync extended 
>>>>>>> virtual synchrony service
>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync 
>>>>>>> configuration service
>>>>>>> orosync[4931]:   [SERV  ] Service engine loaded: corosync cluster 
>>>>>>> closed process group service v1.01
>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync cluster 
>>>>>>> config database access v1.01
>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync profile 
>>>>>>> loading service
>>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync cluster 
>>>>>>> quorum service v0.1
>>>>>>> corosync[4931]:   [MAIN  ] Compatibility mode set to whitetank.  Using 
>>>>>>> V1 and V2 of the synchronization engine.
>>>>>>> corosync[4931]:   [TOTEM ] A processor joined or left the membership 
>>>>>>> and a new membership was formed.
>>>>>>> corosync[4931]:   [CPG   ] chosen downlist: sender r(0) ip(10.87.79.59) 
>>>>>>> ; members(old:0 left:0)
>>>>>>> corosync[4931]:   [MAIN  ] Completed service synchronization, ready to 
>>>>>>> provide service.
>>>>>>> cib[4937]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> cib[4937]:     info: retrieveCib: Reading cluster configuration from: 
>>>>>>> /var/lib/heartbeat/crm/cib.xml (digest: 
>>>>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>>>>> cib[4937]:     info: validate_with_relaxng: Creating RNG parser context
>>>>>>> stonith-ng[4945]:     info: crm_log_init_worker: Changed active 
>>>>>>> directory to /var/lib/heartbeat/cores/root
>>>>>>> stonith-ng[4945]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> stonith-ng[4945]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> stonith-ng[4945]:     info: init_ais_connection_classic: Creating 
>>>>>>> connection to our Corosync plugin
>>>>>>> cib[4944]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> cib[4944]:     info: retrieveCib: Reading cluster configuration from: 
>>>>>>> /var/lib/heartbeat/crm/cib.xml (digest: 
>>>>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>>>>> stonith-ng[4945]:     info: init_ais_connection_classic: AIS connection 
>>>>>>> established
>>>>>>> stonith-ng[4945]:     info: get_ais_nodeid: Server details: 
>>>>>>> id=1003428268 uname=w0110Danmtapp03 cname=pcmk
>>>>>>> stonith-ng[4945]:     info: init_ais_connection_once: Connection to 
>>>>>>> 'classic openais (with plugin)': established
>>>>>>> stonith-ng[4945]:     info: crm_new_peer: Node node03 now has id: 
>>>>>>> 1003428268
>>>>>>> stonith-ng[4945]:     info: crm_new_peer: Node 1003428268 is now known 
>>>>>>> as node03
>>>>>>> cib[4944]:     info: validate_with_relaxng: Creating RNG parser context
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>> signal 15
>>>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>> signal 17
>>>>>>> lrmd: [4939]: info: enabling coredumps
>>>>>>> stonith-ng[4938]:     info: crm_log_init_worker: Changed active 
>>>>>>> directory to /var/lib/heartbeat/cores/root
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>> signal 10
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>>> signal 12
>>>>>>> lrmd: [4939]: info: Started.
>>>>>>> stonith-ng[4938]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> lrmd: [4939]: info: lrmd is shutting down
>>>>>>> stonith-ng[4938]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> stonith-ng[4938]:     info: init_ais_connection_classic: Creating 
>>>>>>> connection to our Corosync plugin
>>>>>>> attrd[4940]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> pengine[4941]:     info: crm_log_init_worker: Changed active directory 
>>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>>> attrd[4940]:     info: main: Starting up
>>>>>>> attrd[4940]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> attrd[4940]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> attrd[4940]:     info: init_ais_connection_classic: Creating connection 
>>>>>>> to our Corosync plugin
>>>>>>> crmd[4942]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> pengine[4941]:     info: main: Starting pengine
>>>>>>> crmd[4942]:   notice: main: CRM Hg Version: 
>>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>>>>>> pengine[4948]:     info: crm_log_init_worker: Changed active directory 
>>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>>> pengine[4948]:  warning: main: Terminating previous PE instance
>>>>>>> attrd[4947]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> pengine[4941]:  warning: process_pe_message: Received quit message, 
>>>>>>> terminating
>>>>>>> attrd[4947]:     info: main: Starting up
>>>>>>> attrd[4947]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> attrd[4947]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> attrd[4947]:     info: init_ais_connection_classic: Creating connection 
>>>>>>> to our Corosync plugin
>>>>>>> crmd[4949]:     info: crm_log_init_worker: Changed active directory to 
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> crmd[4949]:   notice: main: CRM Hg Version: 
>>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>>>>>> stonith-ng[4938]:     info: init_ais_connection_classic: AIS connection 
>>>>>>> established
>>>>>>> stonith-ng[4938]:     info: get_ais_nodeid: Server details: 
>>>>>>> id=1003428268 uname=node03 cname=pcmk
>>>>>>> stonith-ng[4938]:     info: init_ais_connection_once: Connection to 
>>>>>>> 'classic openais (with plugin)': established
>>>>>>> stonith-ng[4938]:     info: crm_new_peer: Node node03 now has id: 
>>>>>>> 1003428268
>>>>>>> stonith-ng[4938]:     info: crm_new_peer: Node 1003428268 is now known 
>>>>>>> as node03
>>>>>>> attrd[4940]:     info: init_ais_connection_classic: AIS connection 
>>>>>>> established
>>>>>>> attrd[4940]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>> uname=node03 cname=pcmk
>>>>>>> attrd[4940]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>> openais (with plugin)': established
>>>>>>> attrd[4940]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> attrd[4940]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>> node03
>>>>>>> attrd[4940]:     info: main: Cluster connection active
>>>>>>> attrd[4940]:     info: main: Accepting attribute updates
>>>>>>> attrd[4940]:   notice: main: Starting mainloop...
>>>>>>> attrd[4947]:     info: init_ais_connection_classic: AIS connection 
>>>>>>> established
>>>>>>> attrd[4947]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>> uname=node03 cname=pcmk
>>>>>>> attrd[4947]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>> openais (with plugin)': established
>>>>>>> attrd[4947]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> attrd[4947]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>> node03
>>>>>>> attrd[4947]:     info: main: Cluster connection active
>>>>>>> attrd[4947]:     info: main: Accepting attribute updates
>>>>>>> attrd[4947]:   notice: main: Starting mainloop...
>>>>>>> cib[4937]:     info: startCib: CIB Initialization completed successfully
>>>>>>> cib[4937]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> cib[4937]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> cib[4937]:     info: init_ais_connection_classic: Creating connection 
>>>>>>> to our Corosync plugin
>>>>>>> cib[4944]:     info: startCib: CIB Initialization completed successfully
>>>>>>> cib[4944]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> cib[4944]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> cib[4944]:     info: init_ais_connection_classic: Creating connection 
>>>>>>> to our Corosync plugin
>>>>>>> cib[4937]:     info: init_ais_connection_classic: AIS connection 
>>>>>>> established
>>>>>>> cib[4937]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>> uname=node03 cname=pcmk
>>>>>>> cib[4937]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>> openais (with plugin)': established
>>>>>>> cib[4937]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> cib[4937]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>> node03
>>>>>>> cib[4937]:     info: cib_init: Starting cib mainloop
>>>>>>> cib[4937]:     info: ais_dispatch_message: Membership 6892: quorum 
>>>>>>> still lost
>>>>>>> cib[4937]:     info: crm_update_peer: Node node03: id=1003428268 
>>>>>>> state=member (new) addr=r(0) ip(10.87.79.59)  (new) votes=1 (new) 
>>>>>>> born=0 seen=6892 proc=00000000000000000000000000111312 (new)
>>>>>>> cib[4944]:     info: init_ais_connection_classic: AIS connection 
>>>>>>> established
>>>>>>> cib[4944]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>> uname=node03 cname=pcmk
>>>>>>> cib[4944]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>> openais (with plugin)': established
>>>>>>> cib[4944]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> cib[4944]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>> node03
>>>>>>> cib[4944]:     info: cib_init: Starting cib mainloop
>>>>>>> stonith-ng[4945]:   notice: setup_cib: Watching for stonith topology 
>>>>>>> changes
>>>>>>> stonith-ng[4945]:     info: main: Starting stonith-ng mainloop
>>>>>>> cib[4937]:     info: ais_dispatch_message: Membership 6896: quorum 
>>>>>>> still lost
>>>>>>> corosync[4931]:   [TOTEM ] A processor joined or left the membership 
>>>>>>> and a new membership was formed.
>>>>>>> cib[4937]:     info: crm_new_peer: Node <null> now has id: 969873836
>>>>>>> cib[4937]:     info: crm_update_peer: Node (null): id=969873836 
>>>>>>> state=member (new) addr=r(0) ip(172.25.207.57)  votes=0 born=0 
>>>>>>> seen=6896 proc=00000000000000000000000000000000
>>>>>>> cib[4937]:     info: crm_new_peer: Node <null> now has id: 986651052
>>>>>>> cib[4937]:     info: crm_update_peer: Node (null): id=986651052 
>>>>>>> state=member (new) addr=r(0) ip(172.25.207.58)  votes=0 born=0 
>>>>>>> seen=6896 proc=00000000000000000000000000000000
>>>>>>> cib[4937]:   notice: ais_dispatch_message: Membership 6896: quorum 
>>>>>>> acquired
>>>>>>> cib[4937]:     info: crm_get_peer: Node 986651052 is now known as node02
>>>>>>> cib[4937]:     info: crm_update_peer: Node node02: id=986651052 
>>>>>>> state=member addr=r(0) ip(172.25.207.58)  votes=1 (new) born=6812 
>>>>>>> seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>>> cib[4937]:     info: ais_dispatch_message: Membership 6896: quorum 
>>>>>>> retained
>>>>>>> cib[4937]:     info: crm_get_peer: Node 969873836 is now known as node01
>>>>>>> cib[4937]:     info: crm_update_peer: Node node01: id=969873836 
>>>>>>> state=member addr=r(0) ip(172.25.207.57)  votes=1 (new) born=6848 
>>>>>>> seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4931 due to 
>>>>>>> rate-limiting
>>>>>>> crmd[4942]:     info: do_cib_control: CIB connection established
>>>>>>> crmd[4942]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> crmd[4942]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> crmd[4942]:     info: init_ais_connection_classic: Creating connection 
>>>>>>> to our Corosync plugin
>>>>>>> cib[4937]:     info: cib_process_diff: Diff 1.249.28 -> 1.249.29 not 
>>>>>>> applied to 1.249.0: current "num_updates" is less than required
>>>>>>> cib[4937]:     info: cib_server_process_diff: Requesting re-sync from 
>>>>>>> peer
>>>>>>> crmd[4949]:     info: do_cib_control: CIB connection established
>>>>>>> crmd[4949]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> crmd[4949]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> crmd[4949]:     info: init_ais_connection_classic: Creating connection 
>>>>>>> to our Corosync plugin
>>>>>>> stonith-ng[4938]:   notice: setup_cib: Watching for stonith topology 
>>>>>>> changes
>>>>>>> stonith-ng[4938]:     info: main: Starting stonith-ng mainloop
>>>>>>> cib[4937]:   notice: cib_server_process_diff: Not applying diff 
>>>>>>> 1.249.29 -> 1.249.30 (sync in progress)
>>>>>>> crmd[4942]:     info: init_ais_connection_classic: AIS connection 
>>>>>>> established
>>>>>>> crmd[4942]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>> uname=node03 cname=pcmk
>>>>>>> crmd[4942]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>> openais (with plugin)': established
>>>>>>> crmd[4942]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> crmd[4942]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>> node03
>>>>>>> crmd[4942]:     info: ais_status_callback: status: node03 is now unknown
>>>>>>> crmd[4942]:     info: do_ha_control: Connected to the cluster
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 1 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: init_ais_connection_classic: AIS connection 
>>>>>>> established
>>>>>>> crmd[4949]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>>> uname=node03 cname=pcmk
>>>>>>> crmd[4949]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>>> openais (with plugin)': established
>>>>>>> crmd[4942]:   notice: ais_dispatch_message: Membership 6896: quorum 
>>>>>>> acquired
>>>>>>> crmd[4949]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> crmd[4949]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>>> node03
>>>>>>> crmd[4942]:     info: crm_new_peer: Node node01 now has id: 969873836
>>>>>>> crmd[4949]:     info: ais_status_callback: status: node03 is now unknown
>>>>>>> crmd[4942]:     info: crm_new_peer: Node 969873836 is now known as 
>>>>>>> node01
>>>>>>> crmd[4949]:     info: do_ha_control: Connected to the cluster
>>>>>>> crmd[4942]:     info: ais_status_callback: status: node01 is now unknown
>>>>>>> crmd[4942]:     info: ais_status_callback: status: node01 is now member 
>>>>>>> (was unknown)
>>>>>>> crmd[4942]:     info: crm_update_peer: Node node01: id=969873836 
>>>>>>> state=member (new) addr=r(0) ip(172.25.207.57)  votes=1 born=6848 
>>>>>>> seen=6896 proc=00000000000000000000000000111312
>>>>>>> crmd[4942]:     info: crm_new_peer: Node node02 now has id: 986651052
>>>>>>> crmd[4942]:     info: crm_new_peer: Node 986651052 is now known as 
>>>>>>> node02
>>>>>>> crmd[4942]:     info: ais_status_callback: status: node02 is now unknown
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 1 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: ais_status_callback: status: node02 is now member 
>>>>>>> (was unknown)
>>>>>>> crmd[4942]:     info: crm_update_peer: Node node02: id=986651052 
>>>>>>> state=member (new) addr=r(0) ip(172.25.207.58)  votes=1 born=6812 
>>>>>>> seen=6896 proc=00000000000000000000000000111312
>>>>>>> crmd[4942]:   notice: crmd_peer_update: Status update: Client 
>>>>>>> node03/crmd now has status [online] (DC=<null>)
>>>>>>> crmd[4942]:     info: ais_status_callback: status: node03 is now member 
>>>>>>> (was unknown)
>>>>>>> crmd[4942]:     info: crm_update_peer: Node node03: id=1003428268 
>>>>>>> state=member (new) addr=r(0) ip(10.87.79.59)  (new) votes=1 (new) 
>>>>>>> born=6896 seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>>> crmd[4942]:     info: ais_dispatch_message: Membership 6896: quorum 
>>>>>>> retained
>>>>>>> cib[4937]:   notice: cib_server_process_diff: Not applying diff 
>>>>>>> 1.249.30 -> 1.249.31 (sync in progress)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 2 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 3 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 2 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:   notice: ais_dispatch_message: Membership 6896: quorum 
>>>>>>> acquired
>>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4937 due to 
>>>>>>> rate-limiting
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 4 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 5 
>>>>>>> (30 max) times
>>>>>>> pengine[4948]:     info: main: Starting pengine
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> warning: do_lrm_control: Failed to sign on to the LRM 6 (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 3 
>>>>>>> (30 max) times
>>>>>>> attrd[4940]:     info: cib_connect: Connected to the CIB after 1 signon 
>>>>>>> attempts
>>>>>>> attrd[4940]:     info: cib_connect: Sending full refresh
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 7 
>>>>>>> (30 max) times
>>>>>>> attrd[4947]:     info: cib_connect: Connected to the CIB after 1 signon 
>>>>>>> attempts
>>>>>>> attrd[4947]:     info: cib_connect: Sending full refresh
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 4 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 8 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 5 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 9 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 6 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 10 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 7 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 11 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 8 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 12 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 9 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 13 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 10 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 14 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 11 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 12 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 15 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 13 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 16 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 14 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 17 
>>>>>>> (30 max) times
>>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 15 
>>>>>>> (30 max) times
>>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>>> (2000ms)
>>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 18 
>>>>>>> (30 max) times
>>>>>>> 
>>>>>>> 
>>>>>>> We have the following components installed..
>>>>>>> 
>>>>>>> 
>>>>>>> corosynclib-1.4.1-15.el6.x86_64
>>>>>>> corosync-1.4.1-15.el6.x86_64
>>>>>>> cluster-glue-libs-1.0.5-6.el6.x86_64
>>>>>>> clusterlib-3.0.12.1-49.el6.x86_64
>>>>>>> pacemaker-cluster-libs-1.1.7-6.el6.x86_64
>>>>>>> cluster-glue-1.0.5-6.el6.x86_64
>>>>>>> resource-agents-3.9.2-12.el6.x86_64
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> We'd appreciate assistance on how to debug what the issue may be and 
>>>>>>> some possible causes.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Jimmy
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> [email protected]
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>> 
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>> 
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>> 
>> 
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Failed to sign on to the LRM error on Corosync Startup

Reply via email to