Re: [Linux-HA] Failed to sign on to the LRM error on Corosync Startup

Andrew Beekhof Thu, 11 Apr 2013 18:11:32 -0700

On 11/04/2013, at 6:05 AM, Jimmy Magee <[email protected]> wrote:


> Hi,
> 
> Following up on the above thread, any thoughts as to what may be causing the 
> issue..

One of the main reasons pacemakerd was created was to avoid weirdness around 
the starting of pacemaker's child processes from within a multi-threaded 
application like corosync... which is almost certainly what you're bumping into 
here.

Could you try using "ver: 1" in corosync.conf and "service pacemaker start" to 
rule out any other causes?

> 
> Cheers,
> Jimmy.
> 
> 
> 
> On 9 Apr 2013, at 13:39, Jimmy Magee <[email protected]> wrote:
> 
>> Hi Andrew,
>> 
>> The corosync.conf is configured as follows:
>> 
>> 
>>> service {
>>>                # Load the Pacemaker Cluster Resource Manager
>>>                 name: pacemaker
>>>                 ver:  0
>>>        }
>> 
>> 
>> 
>> and pacemaker is not started via service pacemaker start…
>> 
>> here is the extract from the logs with extra debug when attempting to start 
>> corosync/pacemaker..
>> 
>> 06:59:20 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and 
>> ready to provide service.
>> 06:59:20 corosync [MAIN  ] Corosync built-in features: nss dbus rdma snmp
>> 06:59:20 corosync [MAIN  ] Successfully read main configuration file 
>> '/etc/corosync/corosync.conf'.
>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1
>> 06:59:20 corosync [TOTEM ] Token Timeout (5000 ms) retransmit timeout (247 
>> ms)
>> 06:59:20 corosync [TOTEM ] token hold (187 ms) retransmits before loss (20 
>> retrans)
>> 06:59:20 corosync [TOTEM ] join (1000 ms) send_join (0 ms) consensus (7500 
>> ms) merge (200 ms)
>> 06:59:20 corosync [TOTEM ] downcheck (1000 ms) fail to recv const (2500 msgs)
>> 06:59:20 corosync [TOTEM ] seqno unchanged const (30 rotations) Maximum 
>> network MTU 1402
>> 06:59:20 corosync [TOTEM ] window size per rotation (50 messages) maximum 
>> messages per rotation (20 messages)
>> 06:59:20 corosync [TOTEM ] missed count const (5 messages)
>> 06:59:20 corosync [TOTEM ] send threads (0 threads)
>> 06:59:20 corosync [TOTEM ] RRP token expired timeout (247 ms)
>> 06:59:20 corosync [TOTEM ] RRP token problem counter (2000 ms)
>> 06:59:20 corosync [TOTEM ] RRP threshold (10 problem count)
>> 06:59:20 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>> 06:59:20 corosync [TOTEM ] RRP automatic recovery check timeout (1000 ms)
>> 06:59:20 corosync [TOTEM ] RRP mode set to none.
>> 06:59:20 corosync [TOTEM ] heartbeat_failures_allowed (0)
>> 06:59:20 corosync [TOTEM ] max_network_delay (50 ms)
>> 06:59:20 corosync [TOTEM ] HeartBeat is Disabled. To enable set 
>> heartbeat_failures_allowed > 0
>> 06:59:20 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
>> 06:59:20 corosync [TOTEM ] Initializing transmit/receive security: 
>> libtomcrypt SOBER128/SHA1HMAC (mode 0).
>> 06:59:20 corosync [IPC   ] you are using ipc api v2
>> 06:59:20 corosync [TOTEM ] Receive multicast socket recv buffer size (320000 
>> bytes).
>> 06:59:20 corosync [TOTEM ] Transmit multicast socket send buffer size 
>> (320000 bytes).
>> 06:59:20 corosync [TOTEM ] Local receive multicast loop socket recv buffer 
>> size (320000 bytes).
>> 06:59:20 corosync [TOTEM ] Local transmit multicast loop socket send buffer 
>> size (320000 bytes).
>> 06:59:20 corosync [TOTEM ] The network interface [10.87.79.59] is now up.
>> 06:59:20 corosync [TOTEM ] Created or loaded sequence id 6984.10.87.79.59 
>> for this ring.
>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log
>> 06:59:20 corosync [pcmk  ] Logging: Initialized pcmk_startup
>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log
>> 06:59:20 corosync [SERV  ] Service engine loaded: Pacemaker Cluster Manager 
>> 1.1.6
>> 06:59:20 corosync [pcmk  ] Logging: Initialized pcmk_startup
>> 06:59:20 corosync [SERV  ] Service engine loaded: Pacemaker Cluster Manager 
>> 1.1.6
>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync extended virtual 
>> synchrony service
>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync configuration 
>> service
>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync cluster closed 
>> process group service v1.01
>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync cluster config 
>> database access v1.01
>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync profile loading 
>> service
>> 06:59:20 corosync [SERV  ] Service engine loaded: corosync cluster quorum 
>> service v0.1
>> 06:59:20 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 
>> and V2 of the synchronization engine.
>> 06:59:20 corosync [TOTEM ] entering GATHER state from 15.
>> 06:59:20 corosync [TOTEM ] Creating commit token because I am the rep.
>> 06:59:20 corosync [TOTEM ] Saving state aru 0 high seq received 0
>> 06:59:20 corosync [TOTEM ] Storing new sequence id for ring 1b4c
>> 06:59:20 corosync [TOTEM ] entering COMMIT state.
>> 06:59:20 corosync [TOTEM ] got commit token
>> 06:59:20 corosync [TOTEM ] entering RECOVERY state.
>> 06:59:20 corosync [TOTEM ] position [0] member 10.87.79.59:
>> 06:59:20 corosync [TOTEM ] previous ring seq 6984 rep 10.87.79.59
>> 06:59:20 corosync [TOTEM ] aru 0 high delivered 0 received flag 1
>> 06:59:20 corosync [TOTEM ] Did not need to originate any messages in 
>> recovery.
>> 06:59:20 corosync [TOTEM ] got commit token
>> 06:59:20 corosync [TOTEM ] Sending initial ORF token
>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>> retrans queue empty 1 count 0, aru 0
>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>> retrans queue empty 1 count 1, aru 0
>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>> retrans queue empty 1 count 2, aru 0
>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 
>> retrans queue empty 1 count 3, aru 0
>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> 06:59:20 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 
>> aru 0 0
>> 06:59:20 corosync [TOTEM ] Resetting old ring state
>> 06:59:20 corosync [TOTEM ] recovery to regular 1-0
>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1
>> 06:59:20 corosync [SYNC  ] This node is within the primary component and 
>> will provide service.
>> 06:59:20 corosync [TOTEM ] entering OPERATIONAL state.
>> 06:59:20 corosync [TOTEM ] A processor joined or left the membership and a 
>> new membership was formed.
>> 06:59:20 corosync [SYNC  ] confchg entries 1
>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 = 
>> 1. 
>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy CLM 
>> service)
>> 06:59:20 corosync [SYNC  ] confchg entries 1
>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 = 
>> 1. 
>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy CLM service)
>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy AMF 
>> service)
>> 06:59:20 corosync [SYNC  ] confchg entries 1
>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 = 
>> 1. 
>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy AMF service)
>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy CKPT 
>> service)
>> 06:59:20 corosync [SYNC  ] confchg entries 1
>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 = 
>> 1. 
>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy CKPT 
>> service)
>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (dummy EVT 
>> service)
>> 06:59:20 corosync [SYNC  ] confchg entries 1
>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 = 
>> 1. 
>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>> 06:59:20 corosync [SYNC  ] Committing synchronization for (dummy EVT service)
>> 06:59:20 corosync [SYNC  ] Synchronization actions starting for (corosync 
>> cluster closed process group service v1.01)
>> 06:59:20 corosync [CPG   ] comparing: sender r(0) ip(10.87.79.59) ; 
>> members(old:0 left:0)
>> 06:59:20 corosync [CPG   ] chosen downlist: sender r(0) ip(10.87.79.59) ; 
>> members(old:0 left:0)
>> 06:59:20 corosync [SYNC  ] confchg entries 1
>> 06:59:20 corosync [SYNC  ] Barrier Start Received From 1003428268
>> 06:59:20 corosync [SYNC  ] Barrier completion status for nodeid 1003428268 = 
>> 1. 
>> 06:59:20 corosync [SYNC  ] Synchronization barrier completed
>> 06:59:20 corosync [SYNC  ] Committing synchronization for (corosync cluster 
>> closed process group service v1.01)
>> 06:59:20 corosync [MAIN  ] Completed service synchronization, ready to 
>> provide service.
>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 0
>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>> handler for signal 15
>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>> handler for signal 17
>> 06:59:20node03lrmd: [14934]: info: enabling coredumps
>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>> handler for signal 10
>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal 
>> handler for signal 12
>> 06:59:20node03lrmd: [14934]: debug: main: run the loop...
>> 06:59:20node03lrmd: [14934]: info: Started.
>> 06:59:20 [14935]node03     attrd:     info: crm_log_init_worker:     Changed 
>> active directory to /var/lib/heartbeat/cores/hacluster
>> 06:59:20 [14935]node03     attrd:     info: main:    Starting up
>> 06:59:20 [14935]node03     attrd:     info: get_cluster_type:        Cluster 
>> type is: 'openais'
>> 06:59:20 [14935]node03     attrd:   notice: crm_cluster_connect:     
>> Connecting to cluster infrastructure: classic openais (with plugin)
>> 06:59:20 [14936]node03   pengine:     info: crm_log_init_worker:     Changed 
>> active directory to /var/lib/heartbeat/cores/hacluster
>> 06:59:20 [14935]node03     attrd:     info: init_ais_connection_classic:     
>> Creating connection to our Corosync plugin
>> 06:59:20 [14936]node03   pengine:    debug: main:    Checking for old 
>> instances of pengine
>> 06:59:20 [14937]node03      crmd:     info: crm_log_init_worker:     Changed 
>> active directory to /var/lib/heartbeat/cores/hacluster
>> 06:59:20 [14936]node03   pengine:    debug: 
>> init_client_ipc_comms_nodispatch:        Attempting to talk on: 
>> /var/run/crm/pengine
>> 06:59:20 [14937]node03      crmd:   notice: main:    CRM Hg Version: 
>> 148fccfd5985c5590cc601123c6c16e966b85d14
>> 06:59:20 [14936]node03   pengine:    debug: 
>> init_client_ipc_comms_nodispatch:        Could not init comms on: 
>> /var/run/crm/pengine
>> 06:59:20 [14936]node03   pengine:    debug: main:    Init server comms
>> 06:59:20 [14936]node03   pengine:     info: main:    Starting pengine
>> 06:59:20 [14937]node03      crmd:    debug: crmd_init:       Starting crmd
>> 06:59:20 [14937]node03      crmd:    debug: s_crmd_fsa:      Processing 
>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
>> 06:59:20 [14937]node03      crmd:    debug: do_fsa_action:   actions:trace:  
>> // A_LOG   
>> 06:59:20 [14937]node03      crmd:    debug: do_log:  FSA: Input I_STARTUP 
>> from crmd_init() received in state S_STARTING
>> 06:59:20 [14937]node03      crmd:    debug: do_fsa_action:   actions:trace:  
>> // A_STARTUP
>> 06:59:20 [14937]node03      crmd:    debug: do_startup:      Registering 
>> Signal Handlers
>> 06:59:20 [14937]node03      crmd:    debug: do_startup:      Creating CIB 
>> and LRM objects
>> 06:59:20 [14937]node03      crmd:    debug: do_fsa_action:   actions:trace:  
>> // A_CIB_START
>> 06:59:20 [14937]node03      crmd:    debug: 
>> init_client_ipc_comms_nodispatch:        Attempting to talk on: 
>> /var/run/crm/cib_rw
>> 06:59:20 [14937]node03      crmd:    debug: 
>> init_client_ipc_comms_nodispatch:        Could not init comms on: 
>> /var/run/crm/cib_rw
>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signon_raw:   
>> Connection to command channel failed
>> 06:59:20 [14937]node03      crmd:    debug: 
>> init_client_ipc_comms_nodispatch:        Attempting to talk on: 
>> /var/run/crm/cib_callback
>> 06:59:20 [14937]node03      crmd:    debug: 
>> init_client_ipc_comms_nodispatch:        Could not init comms on: 
>> /var/run/crm/cib_callback
>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signon_raw:   
>> Connection to callback channel failed
>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signon_raw:   
>> Connection to CIB failed: connection failed
>> 06:59:20 [14937]node03      crmd:    debug: cib_native_signoff:      Signing 
>> out of the CIB Service
>> 06:59:20 [14935]node03     attrd:    debug: init_ais_connection_classic:     
>> Adding fd=6 to mainloop
>> 06:59:20 [14935]node03     attrd:     info: init_ais_connection_classic:     
>> AIS connection established
>> 06:59:20 [14935]node03     attrd:     info: get_ais_nodeid:  Server details: 
>> id=1003428268 uname=node03 cname=pcmk
>> 06:59:20 [14935]node03     attrd:     info: init_ais_connection_once:        
>> Connection to 'classic openais (with plugin)': established
>> 06:59:20 [14935]node03     attrd:    debug: crm_new_peer:    Creating entry 
>> for node node03/1003428268
>> 06:59:20 [14935]node03     attrd:     info: crm_new_peer:    Nodenode03now 
>> has id: 1003428268
>> 06:59:20 [14935]node03     attrd:     info: crm_new_peer:    Node 1003428268 
>> is now known as node03
>> 06:59:20 [14935]node03     attrd:     info: main:    Cluster connection 
>> active
>> 06:59:20 [14935]node03     attrd:     info: main:    Accepting attribute 
>> updates
>> 06:59:20 [14935]node03     attrd:   notice: main:    Starting mainloop...
>> 06:59:20 [14933]node03stonith-ng:     info: crm_log_init_worker:     Changed 
>> active directory to /var/lib/heartbeat/cores/root
>> 06:59:20 [14933]node03stonith-ng:     info: get_cluster_type:        Cluster 
>> type is: 'openais'
>> 06:59:20 [14933]node03stonith-ng:   notice: crm_cluster_connect:     
>> Connecting to cluster infrastructure: classic openais (with plugin)
>> 06:59:20 [14933]node03stonith-ng:     info: init_ais_connection_classic:     
>> Creating connection to our Corosync plugin
>> 06:59:20 [14932]node03       cib:     info: crm_log_init_worker:     Changed 
>> active directory to /var/lib/heartbeat/cores/hacluster
>> 06:59:20 [14932]node03       cib:     info: retrieveCib:     Reading cluster 
>> configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
>> /var/lib/heartbeat/crm/cib.xml.sig)
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk] <cib 
>> epoch="251" num_updates="0" admin_epoch="1" validate-with="pacemaker-1.2" 
>> crm_feature_set="3.0.6" update-origin="node03" update-client="crmd" 
>> cib-last-written="Tue Apr  9 06:48:33 2013" have-quorum="1" >
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]   
>> <configuration >
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]     
>> <crm_config >
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]       
>> <cluster_property_set id="cib-bootstrap-options" >
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]       
>>   <nvpair id="cib-bootstrap-options-default-resource-stickiness" 
>> name="default-resource-stickiness" value="1000" />
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]       
>>   <nvpair id="cib-bootstrap-options-no-quorum-policy" 
>> name="no-quorum-policy" value="ignore" />
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]       
>>   <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" 
>> value="false" />
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]       
>>   <nvpair id="cib-bootstrap-options-expected-quorum-votes" 
>> name="expected-quorum-votes" value="3" />
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]       
>>   <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
>> value="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" />
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]       
>>   <nvpair id="cib-bootstrap-options-cluster-infrastructure" 
>> name="cluster-infrastructure" value="openais" />
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]       
>>   <nvpair id="cib-bootstrap-options-last-lrm-refresh" 
>> name="last-lrm-refresh" value="1365160119" />
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]       
>> </cluster_property_set>
>> 06:59:20 [14932]node03       cib:    debug: readCibXmlFile:  [on-disk]     
>> </crm_config>
>> …
>> …
>> ...
>> 
>> 
>> We are still seeing the extra pacemaker daemons when corosync starts up.
>> As an added check, all pacemaker daemons exited correctly when stoping 
>> corosync.
>> ldmd attempts to start twice..
>> 
>> ps aux | grep lrmd
>> root     16412  0.0  0.0      0     0 ?        Z    07:20   0:00 [lrmd] 
>> <defunct>
>> root     16419  0.0  0.0  34240  1052 ?        S    07:20   0:00 
>> /usr/lib64/heartbeat/lrmd
>> root     21030  0.0  0.0 103244   856 pts/0    S+   08:37   0:00 grep lrmd
>> 
>> 
>> Help to resolve this issue appreciated..
>> 
>> Cheers,
>> Jimmy.
>> 
>> 
>> On 9 Apr 2013, at 00:16, Andrew Beekhof <[email protected]> wrote:
>> 
>>> 
>>> On 08/04/2013, at 9:44 PM, Jimmy Magee <[email protected]> wrote:
>>> 
>>>> Hi Andrew,
>>>> 
>>>> thanks for your reply, we are running at debug level with the following 
>>>> config from corosync.conf
>>>> 
>>>> logging {
>>>>              fileline: off
>>>>              to_syslog: yes
>>>>              to_stderr: no
>>>>              syslog_facility: daemon
>>>>              debug: on
>>>>              timestamp: on
>>>>       }
>>>> 
>>>> Looking at the issue further, there seems to be 2 instances of some 
>>>> pacemaker daemons running on this particular node….
>>>> 
>>>> 
>>>> ps aux | grep pace
>>>> 
>>>> 495       3050  0.2  0.0  89956  7184 ?        S    07:10   0:01 
>>>> /usr/libexec/pacemaker/cib
>>>> root      3051  0.0  0.0  87128  3152 ?        S    07:10   0:00 
>>>> /usr/libexec/pacemaker/stonithd
>>>> 495       3053  0.0  0.0  91188  2840 ?        S    07:10   0:00 
>>>> /usr/libexec/pacemaker/attrd
>>>> 495       3054  0.0  0.0  87336  2484 ?        S    07:10   0:00 
>>>> /usr/libexec/pacemaker/pengine
>>>> 495       3055  0.0  0.0  91332  3156 ?        S    07:10   0:00 
>>>> /usr/libexec/pacemaker/crmd
>>>> 495       3057  0.0  0.0  88876  5224 ?        S    07:10   0:00 
>>>> /usr/libexec/pacemaker/cib
>>>> root      3058  0.0  0.0  87128  3132 ?        S    07:10   0:00 
>>>> /usr/libexec/pacemaker/stonithd
>>>> 495       3060  0.0  0.0  91188  2788 ?        S    07:10   0:00 
>>>> /usr/libexec/pacemaker/attrd
>>>> 495       3062  0.0  0.0  91436  3932 ?        S    07:10   0:00 
>>>> /usr/libexec/pacemaker/crmd
>>>> 
>>>> 
>>>> ps aux | grep corosync
>>>> root      3044  0.1  0.0 977852  9264 ?        Ssl  07:10   0:01 corosync
>>>> root      9363  0.0  0.0 103248   856 pts/0    S+   07:33   0:00 grep 
>>>> corosync
>>>> 
>>>> 
>>>> ps aux | grep lrmd
>>>> root      3052  0.0  0.0  76464  2528 ?        S    07:10   0:00 
>>>> /usr/lib64/heartbeat/lrmd
>>>> 
>>>> 
>>>> Not sure why this is the case? Appreciate any help..
>>>> 
>>> 
>>> Have you perhaps specified "ver: 0" for the pacemaker plugin and run 
>>> "service pacemaker start" ?
>>> 
>>>> Cheers,
>>>> Jimmy.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 8 Apr 2013, at 03:00, Andrew Beekhof <[email protected]> wrote:
>>>> 
>>>>> This doesn't look promising:
>>>>> 
>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>> signal 15
>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit
>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>> signal 17
>>>>> lrmd: [4939]: info: enabling coredumps
>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>> signal 10
>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>> signal 12
>>>>> lrmd: [4939]: info: Started.
>>>>> lrmd: [4939]: info: lrmd is shutting down
>>>>> 
>>>>> The lrmd comes up but then immediately shuts down.
>>>>> Perhaps try enabling debug to see if that sheds any light.
>>>>> 
>>>>> On 06/04/2013, at 4:58 AM, Jimmy Magee <[email protected]> wrote:
>>>>> 
>>>>>> Hi guys,
>>>>>> 
>>>>>> Apologies for reposting this query, it inadvertently got added to an 
>>>>>> existing topic!
>>>>>> 
>>>>>> 
>>>>>> We have a three node cluster deployed in a customer's network:
>>>>>> - 2 nodes are on the same switch
>>>>>> - 3rd node on the same subnet but there's a router in between.
>>>>>> - IP Multicast is enabled and has been tested using omping as follows..
>>>>>> 
>>>>>> On each node ran..
>>>>>> 
>>>>>> omping node01 node02 node3
>>>>>> 
>>>>>> 
>>>>>> ON node 3
>>>>>> 
>>>>>> Node01 :   unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 
>>>>>> 0.128/0.181/0.255/0.025
>>>>>> Node01 : multicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 
>>>>>> 0.140/0.187/0.219/0.021
>>>>>> Node02 :   unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>> 0.115/0.150/0.168/0.021
>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>> 0.134/0.162/0.177/0.014
>>>>>> 
>>>>>> 
>>>>>> On node 2
>>>>>> 
>>>>>> 
>>>>>> Node01 :   unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = 
>>>>>> 0.168/0.191/0.205/0.014
>>>>>> Node01 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), 
>>>>>> min/avg/max/std-dev = 0.138/0.179/0.206/0.028
>>>>>> Node03 :   unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = 
>>>>>> 0.112/0.149/0.175/0.022
>>>>>> Node03 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%), 
>>>>>> min/avg/max/std-dev = 0.124/0.167/0.178/0.018
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On node 1
>>>>>> 
>>>>>> Node02 :   unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>> 0.154/0.185/0.208/0.019
>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev = 
>>>>>> 0.175/0.198/0.214/0.015
>>>>>> Node03 :   unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev = 
>>>>>> 0.114/0.160/0.185/0.019
>>>>>> Node03 : multicast, xmt/rcv/%loss = 23/22/4% (seq>=2 0%), 
>>>>>> min/avg/max/std-dev = 0.124/0.172/0.197/0.019
>>>>>> 
>>>>>> 
>>>>>> - Problem is intermittent but frequent. Occasionally starts fine when 
>>>>>> started from scratch.
>>>>>> 
>>>>>> We suspect the problem is related to node 3 as we can see lrmd failures 
>>>>>> as per the attached log. We've checked permissions are ok as per 
>>>>>> https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> stonith-ng[1437]:    error: ais_dispatch: AIS connection failed
>>>>>> stonith-ng[1437]:    error: stonith_peer_ais_destroy: AIS connection 
>>>>>> terminated
>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: Pacemaker Cluster 
>>>>>> Manager 1.1.6
>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync extended 
>>>>>> virtual synchrony service
>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync 
>>>>>> configuration service
>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync cluster 
>>>>>> closed process group service v1.01
>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync cluster 
>>>>>> config database access v1.01
>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync profile 
>>>>>> loading service
>>>>>> corosync[1430]:   [SERV  ] Service engine unloaded: corosync cluster 
>>>>>> quorum service v0.1
>>>>>> corosync[1430]:   [MAIN  ] Corosync Cluster Engine exiting with status 0 
>>>>>> at main.c:1894.
>>>>>> 
>>>>>> corosync[4931]:   [MAIN  ] Corosync built-in features: nss dbus rdma snmp
>>>>>> corosync[4931]:   [MAIN  ] Successfully read main configuration file 
>>>>>> '/etc/corosync/corosync.conf'.
>>>>>> corosync[4931]:   [TOTEM ] Initializing transport (UDP/IP Multicast).
>>>>>> corosync[4931]:   [TOTEM ] Initializing transmit/receive security: 
>>>>>> libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>>>>> corosync[4931]:   [TOTEM ] The network interface [10.87.79.59] is now up.
>>>>>> corosync[4931]:   [pcmk  ] Logging: Initialized pcmk_startup
>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: Pacemaker Cluster 
>>>>>> Manager 1.1.6
>>>>>> corosync[4931]:   [pcmk  ] Logging: Initialized pcmk_startup
>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: Pacemaker Cluster 
>>>>>> Manager 1.1.6
>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync extended 
>>>>>> virtual synchrony service
>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync configuration 
>>>>>> service
>>>>>> orosync[4931]:   [SERV  ] Service engine loaded: corosync cluster closed 
>>>>>> process group service v1.01
>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync cluster 
>>>>>> config database access v1.01
>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync profile 
>>>>>> loading service
>>>>>> corosync[4931]:   [SERV  ] Service engine loaded: corosync cluster 
>>>>>> quorum service v0.1
>>>>>> corosync[4931]:   [MAIN  ] Compatibility mode set to whitetank.  Using 
>>>>>> V1 and V2 of the synchronization engine.
>>>>>> corosync[4931]:   [TOTEM ] A processor joined or left the membership and 
>>>>>> a new membership was formed.
>>>>>> corosync[4931]:   [CPG   ] chosen downlist: sender r(0) ip(10.87.79.59) 
>>>>>> ; members(old:0 left:0)
>>>>>> corosync[4931]:   [MAIN  ] Completed service synchronization, ready to 
>>>>>> provide service.
>>>>>> cib[4937]:     info: crm_log_init_worker: Changed active directory to 
>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>> cib[4937]:     info: retrieveCib: Reading cluster configuration from: 
>>>>>> /var/lib/heartbeat/crm/cib.xml (digest: 
>>>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>>>> cib[4937]:     info: validate_with_relaxng: Creating RNG parser context
>>>>>> stonith-ng[4945]:     info: crm_log_init_worker: Changed active 
>>>>>> directory to /var/lib/heartbeat/cores/root
>>>>>> stonith-ng[4945]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>> stonith-ng[4945]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>> infrastructure: classic openais (with plugin)
>>>>>> stonith-ng[4945]:     info: init_ais_connection_classic: Creating 
>>>>>> connection to our Corosync plugin
>>>>>> cib[4944]:     info: crm_log_init_worker: Changed active directory to 
>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>> cib[4944]:     info: retrieveCib: Reading cluster configuration from: 
>>>>>> /var/lib/heartbeat/crm/cib.xml (digest: 
>>>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>>>> stonith-ng[4945]:     info: init_ais_connection_classic: AIS connection 
>>>>>> established
>>>>>> stonith-ng[4945]:     info: get_ais_nodeid: Server details: 
>>>>>> id=1003428268 uname=w0110Danmtapp03 cname=pcmk
>>>>>> stonith-ng[4945]:     info: init_ais_connection_once: Connection to 
>>>>>> 'classic openais (with plugin)': established
>>>>>> stonith-ng[4945]:     info: crm_new_peer: Node node03 now has id: 
>>>>>> 1003428268
>>>>>> stonith-ng[4945]:     info: crm_new_peer: Node 1003428268 is now known 
>>>>>> as node03
>>>>>> cib[4944]:     info: validate_with_relaxng: Creating RNG parser context
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>> signal 15
>>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>> signal 17
>>>>>> lrmd: [4939]: info: enabling coredumps
>>>>>> stonith-ng[4938]:     info: crm_log_init_worker: Changed active 
>>>>>> directory to /var/lib/heartbeat/cores/root
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>> signal 10
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for 
>>>>>> signal 12
>>>>>> lrmd: [4939]: info: Started.
>>>>>> stonith-ng[4938]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>> lrmd: [4939]: info: lrmd is shutting down
>>>>>> stonith-ng[4938]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>> infrastructure: classic openais (with plugin)
>>>>>> stonith-ng[4938]:     info: init_ais_connection_classic: Creating 
>>>>>> connection to our Corosync plugin
>>>>>> attrd[4940]:     info: crm_log_init_worker: Changed active directory to 
>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>> pengine[4941]:     info: crm_log_init_worker: Changed active directory 
>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>> attrd[4940]:     info: main: Starting up
>>>>>> attrd[4940]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>> attrd[4940]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>> infrastructure: classic openais (with plugin)
>>>>>> attrd[4940]:     info: init_ais_connection_classic: Creating connection 
>>>>>> to our Corosync plugin
>>>>>> crmd[4942]:     info: crm_log_init_worker: Changed active directory to 
>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>> pengine[4941]:     info: main: Starting pengine
>>>>>> crmd[4942]:   notice: main: CRM Hg Version: 
>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>>>>> pengine[4948]:     info: crm_log_init_worker: Changed active directory 
>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>> pengine[4948]:  warning: main: Terminating previous PE instance
>>>>>> attrd[4947]:     info: crm_log_init_worker: Changed active directory to 
>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>> pengine[4941]:  warning: process_pe_message: Received quit message, 
>>>>>> terminating
>>>>>> attrd[4947]:     info: main: Starting up
>>>>>> attrd[4947]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>> attrd[4947]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>> infrastructure: classic openais (with plugin)
>>>>>> attrd[4947]:     info: init_ais_connection_classic: Creating connection 
>>>>>> to our Corosync plugin
>>>>>> crmd[4949]:     info: crm_log_init_worker: Changed active directory to 
>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>> crmd[4949]:   notice: main: CRM Hg Version: 
>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>>>>> stonith-ng[4938]:     info: init_ais_connection_classic: AIS connection 
>>>>>> established
>>>>>> stonith-ng[4938]:     info: get_ais_nodeid: Server details: 
>>>>>> id=1003428268 uname=node03 cname=pcmk
>>>>>> stonith-ng[4938]:     info: init_ais_connection_once: Connection to 
>>>>>> 'classic openais (with plugin)': established
>>>>>> stonith-ng[4938]:     info: crm_new_peer: Node node03 now has id: 
>>>>>> 1003428268
>>>>>> stonith-ng[4938]:     info: crm_new_peer: Node 1003428268 is now known 
>>>>>> as node03
>>>>>> attrd[4940]:     info: init_ais_connection_classic: AIS connection 
>>>>>> established
>>>>>> attrd[4940]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>> uname=node03 cname=pcmk
>>>>>> attrd[4940]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>> openais (with plugin)': established
>>>>>> attrd[4940]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>> attrd[4940]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>> node03
>>>>>> attrd[4940]:     info: main: Cluster connection active
>>>>>> attrd[4940]:     info: main: Accepting attribute updates
>>>>>> attrd[4940]:   notice: main: Starting mainloop...
>>>>>> attrd[4947]:     info: init_ais_connection_classic: AIS connection 
>>>>>> established
>>>>>> attrd[4947]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>> uname=node03 cname=pcmk
>>>>>> attrd[4947]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>> openais (with plugin)': established
>>>>>> attrd[4947]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>> attrd[4947]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>> node03
>>>>>> attrd[4947]:     info: main: Cluster connection active
>>>>>> attrd[4947]:     info: main: Accepting attribute updates
>>>>>> attrd[4947]:   notice: main: Starting mainloop...
>>>>>> cib[4937]:     info: startCib: CIB Initialization completed successfully
>>>>>> cib[4937]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>> cib[4937]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>> infrastructure: classic openais (with plugin)
>>>>>> cib[4937]:     info: init_ais_connection_classic: Creating connection to 
>>>>>> our Corosync plugin
>>>>>> cib[4944]:     info: startCib: CIB Initialization completed successfully
>>>>>> cib[4944]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>> cib[4944]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>> infrastructure: classic openais (with plugin)
>>>>>> cib[4944]:     info: init_ais_connection_classic: Creating connection to 
>>>>>> our Corosync plugin
>>>>>> cib[4937]:     info: init_ais_connection_classic: AIS connection 
>>>>>> established
>>>>>> cib[4937]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>> uname=node03 cname=pcmk
>>>>>> cib[4937]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>> openais (with plugin)': established
>>>>>> cib[4937]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>> cib[4937]:     info: crm_new_peer: Node 1003428268 is now known as node03
>>>>>> cib[4937]:     info: cib_init: Starting cib mainloop
>>>>>> cib[4937]:     info: ais_dispatch_message: Membership 6892: quorum still 
>>>>>> lost
>>>>>> cib[4937]:     info: crm_update_peer: Node node03: id=1003428268 
>>>>>> state=member (new) addr=r(0) ip(10.87.79.59)  (new) votes=1 (new) born=0 
>>>>>> seen=6892 proc=00000000000000000000000000111312 (new)
>>>>>> cib[4944]:     info: init_ais_connection_classic: AIS connection 
>>>>>> established
>>>>>> cib[4944]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>> uname=node03 cname=pcmk
>>>>>> cib[4944]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>> openais (with plugin)': established
>>>>>> cib[4944]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>> cib[4944]:     info: crm_new_peer: Node 1003428268 is now known as node03
>>>>>> cib[4944]:     info: cib_init: Starting cib mainloop
>>>>>> stonith-ng[4945]:   notice: setup_cib: Watching for stonith topology 
>>>>>> changes
>>>>>> stonith-ng[4945]:     info: main: Starting stonith-ng mainloop
>>>>>> cib[4937]:     info: ais_dispatch_message: Membership 6896: quorum still 
>>>>>> lost
>>>>>> corosync[4931]:   [TOTEM ] A processor joined or left the membership and 
>>>>>> a new membership was formed.
>>>>>> cib[4937]:     info: crm_new_peer: Node <null> now has id: 969873836
>>>>>> cib[4937]:     info: crm_update_peer: Node (null): id=969873836 
>>>>>> state=member (new) addr=r(0) ip(172.25.207.57)  votes=0 born=0 seen=6896 
>>>>>> proc=00000000000000000000000000000000
>>>>>> cib[4937]:     info: crm_new_peer: Node <null> now has id: 986651052
>>>>>> cib[4937]:     info: crm_update_peer: Node (null): id=986651052 
>>>>>> state=member (new) addr=r(0) ip(172.25.207.58)  votes=0 born=0 seen=6896 
>>>>>> proc=00000000000000000000000000000000
>>>>>> cib[4937]:   notice: ais_dispatch_message: Membership 6896: quorum 
>>>>>> acquired
>>>>>> cib[4937]:     info: crm_get_peer: Node 986651052 is now known as node02
>>>>>> cib[4937]:     info: crm_update_peer: Node node02: id=986651052 
>>>>>> state=member addr=r(0) ip(172.25.207.58)  votes=1 (new) born=6812 
>>>>>> seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>> cib[4937]:     info: ais_dispatch_message: Membership 6896: quorum 
>>>>>> retained
>>>>>> cib[4937]:     info: crm_get_peer: Node 969873836 is now known as node01
>>>>>> cib[4937]:     info: crm_update_peer: Node node01: id=969873836 
>>>>>> state=member addr=r(0) ip(172.25.207.57)  votes=1 (new) born=6848 
>>>>>> seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4931 due to 
>>>>>> rate-limiting
>>>>>> crmd[4942]:     info: do_cib_control: CIB connection established
>>>>>> crmd[4942]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>> crmd[4942]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>> infrastructure: classic openais (with plugin)
>>>>>> crmd[4942]:     info: init_ais_connection_classic: Creating connection 
>>>>>> to our Corosync plugin
>>>>>> cib[4937]:     info: cib_process_diff: Diff 1.249.28 -> 1.249.29 not 
>>>>>> applied to 1.249.0: current "num_updates" is less than required
>>>>>> cib[4937]:     info: cib_server_process_diff: Requesting re-sync from 
>>>>>> peer
>>>>>> crmd[4949]:     info: do_cib_control: CIB connection established
>>>>>> crmd[4949]:     info: get_cluster_type: Cluster type is: 'openais'
>>>>>> crmd[4949]:   notice: crm_cluster_connect: Connecting to cluster 
>>>>>> infrastructure: classic openais (with plugin)
>>>>>> crmd[4949]:     info: init_ais_connection_classic: Creating connection 
>>>>>> to our Corosync plugin
>>>>>> stonith-ng[4938]:   notice: setup_cib: Watching for stonith topology 
>>>>>> changes
>>>>>> stonith-ng[4938]:     info: main: Starting stonith-ng mainloop
>>>>>> cib[4937]:   notice: cib_server_process_diff: Not applying diff 1.249.29 
>>>>>> -> 1.249.30 (sync in progress)
>>>>>> crmd[4942]:     info: init_ais_connection_classic: AIS connection 
>>>>>> established
>>>>>> crmd[4942]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>> uname=node03 cname=pcmk
>>>>>> crmd[4942]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>> openais (with plugin)': established
>>>>>> crmd[4942]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>> crmd[4942]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>> node03
>>>>>> crmd[4942]:     info: ais_status_callback: status: node03 is now unknown
>>>>>> crmd[4942]:     info: do_ha_control: Connected to the cluster
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 1 (30 
>>>>>> max) times
>>>>>> crmd[4949]:     info: init_ais_connection_classic: AIS connection 
>>>>>> established
>>>>>> crmd[4949]:     info: get_ais_nodeid: Server details: id=1003428268 
>>>>>> uname=node03 cname=pcmk
>>>>>> crmd[4949]:     info: init_ais_connection_once: Connection to 'classic 
>>>>>> openais (with plugin)': established
>>>>>> crmd[4942]:   notice: ais_dispatch_message: Membership 6896: quorum 
>>>>>> acquired
>>>>>> crmd[4949]:     info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>> crmd[4949]:     info: crm_new_peer: Node 1003428268 is now known as 
>>>>>> node03
>>>>>> crmd[4942]:     info: crm_new_peer: Node node01 now has id: 969873836
>>>>>> crmd[4949]:     info: ais_status_callback: status: node03 is now unknown
>>>>>> crmd[4942]:     info: crm_new_peer: Node 969873836 is now known as node01
>>>>>> crmd[4949]:     info: do_ha_control: Connected to the cluster
>>>>>> crmd[4942]:     info: ais_status_callback: status: node01 is now unknown
>>>>>> crmd[4942]:     info: ais_status_callback: status: node01 is now member 
>>>>>> (was unknown)
>>>>>> crmd[4942]:     info: crm_update_peer: Node node01: id=969873836 
>>>>>> state=member (new) addr=r(0) ip(172.25.207.57)  votes=1 born=6848 
>>>>>> seen=6896 proc=00000000000000000000000000111312
>>>>>> crmd[4942]:     info: crm_new_peer: Node node02 now has id: 986651052
>>>>>> crmd[4942]:     info: crm_new_peer: Node 986651052 is now known as node02
>>>>>> crmd[4942]:     info: ais_status_callback: status: node02 is now unknown
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 1 (30 
>>>>>> max) times
>>>>>> crmd[4942]:     info: ais_status_callback: status: node02 is now member 
>>>>>> (was unknown)
>>>>>> crmd[4942]:     info: crm_update_peer: Node node02: id=986651052 
>>>>>> state=member (new) addr=r(0) ip(172.25.207.58)  votes=1 born=6812 
>>>>>> seen=6896 proc=00000000000000000000000000111312
>>>>>> crmd[4942]:   notice: crmd_peer_update: Status update: Client 
>>>>>> node03/crmd now has status [online] (DC=<null>)
>>>>>> crmd[4942]:     info: ais_status_callback: status: node03 is now member 
>>>>>> (was unknown)
>>>>>> crmd[4942]:     info: crm_update_peer: Node node03: id=1003428268 
>>>>>> state=member (new) addr=r(0) ip(10.87.79.59)  (new) votes=1 (new) 
>>>>>> born=6896 seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>> crmd[4942]:     info: ais_dispatch_message: Membership 6896: quorum 
>>>>>> retained
>>>>>> cib[4937]:   notice: cib_server_process_diff: Not applying diff 1.249.30 
>>>>>> -> 1.249.31 (sync in progress)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 2 (30 
>>>>>> max) times
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 3 (30 
>>>>>> max) times
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 2 (30 
>>>>>> max) times
>>>>>> crmd[4949]:   notice: ais_dispatch_message: Membership 6896: quorum 
>>>>>> acquired
>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4937 due to 
>>>>>> rate-limiting
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 4 (30 
>>>>>> max) times
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 5 (30 
>>>>>> max) times
>>>>>> pengine[4948]:     info: main: Starting pengine
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> warning: do_lrm_control: Failed to sign on to the LRM 6 (30 max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 3 (30 
>>>>>> max) times
>>>>>> attrd[4940]:     info: cib_connect: Connected to the CIB after 1 signon 
>>>>>> attempts
>>>>>> attrd[4940]:     info: cib_connect: Sending full refresh
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 7 (30 
>>>>>> max) times
>>>>>> attrd[4947]:     info: cib_connect: Connected to the CIB after 1 signon 
>>>>>> attempts
>>>>>> attrd[4947]:     info: cib_connect: Sending full refresh
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 4 (30 
>>>>>> max) times
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 8 (30 
>>>>>> max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 5 (30 
>>>>>> max) times
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 9 (30 
>>>>>> max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 6 (30 
>>>>>> max) times
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 10 
>>>>>> (30 max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 7 (30 
>>>>>> max) times
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 11 
>>>>>> (30 max) times
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 8 (30 
>>>>>> max) times
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 12 
>>>>>> (30 max) times
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 9 (30 
>>>>>> max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 13 
>>>>>> (30 max) times
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 10 
>>>>>> (30 max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 14 
>>>>>> (30 max) times
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 11 
>>>>>> (30 max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 12 
>>>>>> (30 max) times
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 15 
>>>>>> (30 max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 13 
>>>>>> (30 max) times
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 16 
>>>>>> (30 max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 14 
>>>>>> (30 max) times
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 17 
>>>>>> (30 max) times
>>>>>> crmd[4949]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4949]:  warning: do_lrm_control: Failed to sign on to the LRM 15 
>>>>>> (30 max) times
>>>>>> crmd[4942]:     info: crm_timer_popped: Wait Timer (I_NULL) just popped 
>>>>>> (2000ms)
>>>>>> crmd[4942]:  warning: do_lrm_control: Failed to sign on to the LRM 18 
>>>>>> (30 max) times
>>>>>> 
>>>>>> 
>>>>>> We have the following components installed..
>>>>>> 
>>>>>> 
>>>>>> corosynclib-1.4.1-15.el6.x86_64
>>>>>> corosync-1.4.1-15.el6.x86_64
>>>>>> cluster-glue-libs-1.0.5-6.el6.x86_64
>>>>>> clusterlib-3.0.12.1-49.el6.x86_64
>>>>>> pacemaker-cluster-libs-1.1.7-6.el6.x86_64
>>>>>> cluster-glue-1.0.5-6.el6.x86_64
>>>>>> resource-agents-3.9.2-12.el6.x86_64
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> We'd appreciate assistance on how to debug what the issue may be and 
>>>>>> some possible causes.
>>>>>> 
>>>>>> Cheers,
>>>>>> Jimmy
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> [email protected]
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>> 
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>> 
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>> 
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Failed to sign on to the LRM error on Corosync Startup

Reply via email to