Hi Andrew,
Thanks for your reply, we tried that option but to no avail.
To resolve the issue what worked for us was to remove existing ha packages and
update to pacemaker to 1.18-7.
Here is the procedure…
1. Backup /etc/corosync/corosync.conf, /etc/corosync/authkey.
2. Export cib.xml:
cibadmin -Q > /tmp/ha_backup/cib.xml
3. Stop corosync services on all nodes
4. Remove existing HA packages:
yum -y remove pacemaker corosync heartbeat resource-agents
cluster-glue rgmanager lvm2-cluster gfs2-utils
5. Install updated HA Packages:
yum -y install pacemaker cman ccs resource agents
resulting in the following packages being installed..
pacemaker-doc-1.1.8-7.el6.x86_64
pacemaker-cli-1.1.8-7.el6.x86_64
pacemaker-libs-1.1.8-7.el6.x86_64
pacemaker-cts-1.1.8-7.el6.x86_64
pacemaker-libs-devel-1.1.8-7.el6.x86_64
pacemaker-cluster-libs-1.1.8-7.el6.x86_64
pacemaker-1.1.8-7.el6.x86_64
pacemaker-debuginfo-1.1.8-7.el6.x86_64
cman-3.0.12.1-49.el6.x86_64
ccs-0.16.2-55.el6.x86_64
resource-agents-3.9.2-12.el6.x86_64
cluster-glue-libs-1.0.5-6.el6.x86_64
corosync-1.4.1-15.el6.x86_64
corosynclib-1.4.1-15.el6.x86_64
corosync-debuginfo-1.4.1-15.el6.x86_64
corosynclib-devel-1.4.1-15.el6.x86_64
6. Get crm package and install:
yum -y install crmsh*
7. Start the ricci service:
service ricci start
Also ensure it restarts:
chkconfig --add ricci
8. Set ricci passwd:
passwd ricci
9. Configure the cluster:
ccs -f /etc/cluster/cluster.conf --createcluster testprod -i
ccs -f /etc/cluster/cluster.conf --addnode w0110Danmtapp01
ccs -f /etc/cluster/cluster.conf --addnode node02
ccs -f /etc/cluster/cluster.conf --addnode node03
ccs -f /etc/cluster/cluster.conf --addfencedev pcmk
agent=fence_pcmk
ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect
node01
ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect
node02
ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect
node03
ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node01
pcmk-redirect port=1
ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node02
pcmk-redirect port=2
ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk node03
pcmk-redirect port=3
ccs -f /etc/cluster/cluster.conf --setlogging debug=on
ccs -f /etc/cluster/cluster.conf --settotem
10. Distribute the cluster.conf:
ccs -h node01 -p ************** --sync --activate
11. Set Cluster timeout to 0 on all three nodes separately:
echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/sysconfig/cman
12. Start the services on each node:
service cman start
service pacemaker start
also ensure they restart:
chkconfig --add cman
chkconfig --add pacemaker
Best of luck,
Jimmy.
On 12 Apr 2013, at 02:11, Andrew Beekhof <[email protected]> wrote:
>
> On 11/04/2013, at 6:05 AM, Jimmy Magee <[email protected]> wrote:
>
>> Hi,
>>
>> Following up on the above thread, any thoughts as to what may be causing the
>> issue..
>
> One of the main reasons pacemakerd was created was to avoid weirdness around
> the starting of pacemaker's child processes from within a multi-threaded
> application like corosync... which is almost certainly what you're bumping
> into here.
>
> Could you try using "ver: 1" in corosync.conf and "service pacemaker start"
> to rule out any other causes?
>
>>
>> Cheers,
>> Jimmy.
>>
>>
>>
>> On 9 Apr 2013, at 13:39, Jimmy Magee <[email protected]> wrote:
>>
>>> Hi Andrew,
>>>
>>> The corosync.conf is configured as follows:
>>>
>>>
>>>> service {
>>>> # Load the Pacemaker Cluster Resource Manager
>>>> name: pacemaker
>>>> ver: 0
>>>> }
>>>
>>>
>>>
>>> and pacemaker is not started via service pacemaker start…
>>>
>>> here is the extract from the logs with extra debug when attempting to start
>>> corosync/pacemaker..
>>>
>>> 06:59:20 corosync [MAIN ] Corosync Cluster Engine ('1.4.1'): started and
>>> ready to provide service.
>>> 06:59:20 corosync [MAIN ] Corosync built-in features: nss dbus rdma snmp
>>> 06:59:20 corosync [MAIN ] Successfully read main configuration file
>>> '/etc/corosync/corosync.conf'.
>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1
>>> 06:59:20 corosync [TOTEM ] Token Timeout (5000 ms) retransmit timeout (247
>>> ms)
>>> 06:59:20 corosync [TOTEM ] token hold (187 ms) retransmits before loss (20
>>> retrans)
>>> 06:59:20 corosync [TOTEM ] join (1000 ms) send_join (0 ms) consensus (7500
>>> ms) merge (200 ms)
>>> 06:59:20 corosync [TOTEM ] downcheck (1000 ms) fail to recv const (2500
>>> msgs)
>>> 06:59:20 corosync [TOTEM ] seqno unchanged const (30 rotations) Maximum
>>> network MTU 1402
>>> 06:59:20 corosync [TOTEM ] window size per rotation (50 messages) maximum
>>> messages per rotation (20 messages)
>>> 06:59:20 corosync [TOTEM ] missed count const (5 messages)
>>> 06:59:20 corosync [TOTEM ] send threads (0 threads)
>>> 06:59:20 corosync [TOTEM ] RRP token expired timeout (247 ms)
>>> 06:59:20 corosync [TOTEM ] RRP token problem counter (2000 ms)
>>> 06:59:20 corosync [TOTEM ] RRP threshold (10 problem count)
>>> 06:59:20 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>>> 06:59:20 corosync [TOTEM ] RRP automatic recovery check timeout (1000 ms)
>>> 06:59:20 corosync [TOTEM ] RRP mode set to none.
>>> 06:59:20 corosync [TOTEM ] heartbeat_failures_allowed (0)
>>> 06:59:20 corosync [TOTEM ] max_network_delay (50 ms)
>>> 06:59:20 corosync [TOTEM ] HeartBeat is Disabled. To enable set
>>> heartbeat_failures_allowed > 0
>>> 06:59:20 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
>>> 06:59:20 corosync [TOTEM ] Initializing transmit/receive security:
>>> libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>> 06:59:20 corosync [IPC ] you are using ipc api v2
>>> 06:59:20 corosync [TOTEM ] Receive multicast socket recv buffer size
>>> (320000 bytes).
>>> 06:59:20 corosync [TOTEM ] Transmit multicast socket send buffer size
>>> (320000 bytes).
>>> 06:59:20 corosync [TOTEM ] Local receive multicast loop socket recv buffer
>>> size (320000 bytes).
>>> 06:59:20 corosync [TOTEM ] Local transmit multicast loop socket send buffer
>>> size (320000 bytes).
>>> 06:59:20 corosync [TOTEM ] The network interface [10.87.79.59] is now up.
>>> 06:59:20 corosync [TOTEM ] Created or loaded sequence id 6984.10.87.79.59
>>> for this ring.
>>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log
>>> 06:59:20 corosync [pcmk ] Logging: Initialized pcmk_startup
>>> Set r/w permissions for uid=0, gid=0 on /var/log/corosync.log
>>> 06:59:20 corosync [SERV ] Service engine loaded: Pacemaker Cluster Manager
>>> 1.1.6
>>> 06:59:20 corosync [pcmk ] Logging: Initialized pcmk_startup
>>> 06:59:20 corosync [SERV ] Service engine loaded: Pacemaker Cluster Manager
>>> 1.1.6
>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync extended virtual
>>> synchrony service
>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync configuration
>>> service
>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync cluster closed
>>> process group service v1.01
>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync cluster config
>>> database access v1.01
>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync profile loading
>>> service
>>> 06:59:20 corosync [SERV ] Service engine loaded: corosync cluster quorum
>>> service v0.1
>>> 06:59:20 corosync [MAIN ] Compatibility mode set to whitetank. Using V1
>>> and V2 of the synchronization engine.
>>> 06:59:20 corosync [TOTEM ] entering GATHER state from 15.
>>> 06:59:20 corosync [TOTEM ] Creating commit token because I am the rep.
>>> 06:59:20 corosync [TOTEM ] Saving state aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] Storing new sequence id for ring 1b4c
>>> 06:59:20 corosync [TOTEM ] entering COMMIT state.
>>> 06:59:20 corosync [TOTEM ] got commit token
>>> 06:59:20 corosync [TOTEM ] entering RECOVERY state.
>>> 06:59:20 corosync [TOTEM ] position [0] member 10.87.79.59:
>>> 06:59:20 corosync [TOTEM ] previous ring seq 6984 rep 10.87.79.59
>>> 06:59:20 corosync [TOTEM ] aru 0 high delivered 0 received flag 1
>>> 06:59:20 corosync [TOTEM ] Did not need to originate any messages in
>>> recovery.
>>> 06:59:20 corosync [TOTEM ] got commit token
>>> 06:59:20 corosync [TOTEM ] Sending initial ORF token
>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0
>>> retrans queue empty 1 count 0, aru 0
>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0
>>> retrans queue empty 1 count 1, aru 0
>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0
>>> retrans queue empty 1 count 2, aru 0
>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0
>>> retrans queue empty 1 count 3, aru 0
>>> 06:59:20 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>>> 06:59:20 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0
>>> aru 0 0
>>> 06:59:20 corosync [TOTEM ] Resetting old ring state
>>> 06:59:20 corosync [TOTEM ] recovery to regular 1-0
>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 1
>>> 06:59:20 corosync [SYNC ] This node is within the primary component and
>>> will provide service.
>>> 06:59:20 corosync [TOTEM ] entering OPERATIONAL state.
>>> 06:59:20 corosync [TOTEM ] A processor joined or left the membership and a
>>> new membership was formed.
>>> 06:59:20 corosync [SYNC ] confchg entries 1
>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268
>>> = 1.
>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy CLM
>>> service)
>>> 06:59:20 corosync [SYNC ] confchg entries 1
>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268
>>> = 1.
>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy CLM
>>> service)
>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy AMF
>>> service)
>>> 06:59:20 corosync [SYNC ] confchg entries 1
>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268
>>> = 1.
>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy AMF
>>> service)
>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy CKPT
>>> service)
>>> 06:59:20 corosync [SYNC ] confchg entries 1
>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268
>>> = 1.
>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy CKPT
>>> service)
>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (dummy EVT
>>> service)
>>> 06:59:20 corosync [SYNC ] confchg entries 1
>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268
>>> = 1.
>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC ] Committing synchronization for (dummy EVT
>>> service)
>>> 06:59:20 corosync [SYNC ] Synchronization actions starting for (corosync
>>> cluster closed process group service v1.01)
>>> 06:59:20 corosync [CPG ] comparing: sender r(0) ip(10.87.79.59) ;
>>> members(old:0 left:0)
>>> 06:59:20 corosync [CPG ] chosen downlist: sender r(0) ip(10.87.79.59) ;
>>> members(old:0 left:0)
>>> 06:59:20 corosync [SYNC ] confchg entries 1
>>> 06:59:20 corosync [SYNC ] Barrier Start Received From 1003428268
>>> 06:59:20 corosync [SYNC ] Barrier completion status for nodeid 1003428268
>>> = 1.
>>> 06:59:20 corosync [SYNC ] Synchronization barrier completed
>>> 06:59:20 corosync [SYNC ] Committing synchronization for (corosync cluster
>>> closed process group service v1.01)
>>> 06:59:20 corosync [MAIN ] Completed service synchronization, ready to
>>> provide service.
>>> 06:59:20 corosync [TOTEM ] waiting_trans_ack changed to 0
>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal
>>> handler for signal 15
>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal
>>> handler for signal 17
>>> 06:59:20node03lrmd: [14934]: info: enabling coredumps
>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal
>>> handler for signal 10
>>> 06:59:20node03lrmd: [14934]: info: G_main_add_SignalHandler: Added signal
>>> handler for signal 12
>>> 06:59:20node03lrmd: [14934]: debug: main: run the loop...
>>> 06:59:20node03lrmd: [14934]: info: Started.
>>> 06:59:20 [14935]node03 attrd: info: crm_log_init_worker: Changed
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> 06:59:20 [14935]node03 attrd: info: main: Starting up
>>> 06:59:20 [14935]node03 attrd: info: get_cluster_type: Cluster
>>> type is: 'openais'
>>> 06:59:20 [14935]node03 attrd: notice: crm_cluster_connect:
>>> Connecting to cluster infrastructure: classic openais (with plugin)
>>> 06:59:20 [14936]node03 pengine: info: crm_log_init_worker: Changed
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> 06:59:20 [14935]node03 attrd: info: init_ais_connection_classic:
>>> Creating connection to our Corosync plugin
>>> 06:59:20 [14936]node03 pengine: debug: main: Checking for old
>>> instances of pengine
>>> 06:59:20 [14937]node03 crmd: info: crm_log_init_worker: Changed
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> 06:59:20 [14936]node03 pengine: debug:
>>> init_client_ipc_comms_nodispatch: Attempting to talk on:
>>> /var/run/crm/pengine
>>> 06:59:20 [14937]node03 crmd: notice: main: CRM Hg Version:
>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>> 06:59:20 [14936]node03 pengine: debug:
>>> init_client_ipc_comms_nodispatch: Could not init comms on:
>>> /var/run/crm/pengine
>>> 06:59:20 [14936]node03 pengine: debug: main: Init server comms
>>> 06:59:20 [14936]node03 pengine: info: main: Starting pengine
>>> 06:59:20 [14937]node03 crmd: debug: crmd_init: Starting crmd
>>> 06:59:20 [14937]node03 crmd: debug: s_crmd_fsa: Processing
>>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
>>> 06:59:20 [14937]node03 crmd: debug: do_fsa_action: actions:trace:
>>> // A_LOG
>>> 06:59:20 [14937]node03 crmd: debug: do_log: FSA: Input
>>> I_STARTUP from crmd_init() received in state S_STARTING
>>> 06:59:20 [14937]node03 crmd: debug: do_fsa_action: actions:trace:
>>> // A_STARTUP
>>> 06:59:20 [14937]node03 crmd: debug: do_startup: Registering
>>> Signal Handlers
>>> 06:59:20 [14937]node03 crmd: debug: do_startup: Creating CIB
>>> and LRM objects
>>> 06:59:20 [14937]node03 crmd: debug: do_fsa_action: actions:trace:
>>> // A_CIB_START
>>> 06:59:20 [14937]node03 crmd: debug:
>>> init_client_ipc_comms_nodispatch: Attempting to talk on:
>>> /var/run/crm/cib_rw
>>> 06:59:20 [14937]node03 crmd: debug:
>>> init_client_ipc_comms_nodispatch: Could not init comms on:
>>> /var/run/crm/cib_rw
>>> 06:59:20 [14937]node03 crmd: debug: cib_native_signon_raw:
>>> Connection to command channel failed
>>> 06:59:20 [14937]node03 crmd: debug:
>>> init_client_ipc_comms_nodispatch: Attempting to talk on:
>>> /var/run/crm/cib_callback
>>> 06:59:20 [14937]node03 crmd: debug:
>>> init_client_ipc_comms_nodispatch: Could not init comms on:
>>> /var/run/crm/cib_callback
>>> 06:59:20 [14937]node03 crmd: debug: cib_native_signon_raw:
>>> Connection to callback channel failed
>>> 06:59:20 [14937]node03 crmd: debug: cib_native_signon_raw:
>>> Connection to CIB failed: connection failed
>>> 06:59:20 [14937]node03 crmd: debug: cib_native_signoff: Signing
>>> out of the CIB Service
>>> 06:59:20 [14935]node03 attrd: debug: init_ais_connection_classic:
>>> Adding fd=6 to mainloop
>>> 06:59:20 [14935]node03 attrd: info: init_ais_connection_classic:
>>> AIS connection established
>>> 06:59:20 [14935]node03 attrd: info: get_ais_nodeid: Server
>>> details: id=1003428268 uname=node03 cname=pcmk
>>> 06:59:20 [14935]node03 attrd: info: init_ais_connection_once:
>>> Connection to 'classic openais (with plugin)': established
>>> 06:59:20 [14935]node03 attrd: debug: crm_new_peer: Creating entry
>>> for node node03/1003428268
>>> 06:59:20 [14935]node03 attrd: info: crm_new_peer: Nodenode03now
>>> has id: 1003428268
>>> 06:59:20 [14935]node03 attrd: info: crm_new_peer: Node 1003428268
>>> is now known as node03
>>> 06:59:20 [14935]node03 attrd: info: main: Cluster connection
>>> active
>>> 06:59:20 [14935]node03 attrd: info: main: Accepting attribute
>>> updates
>>> 06:59:20 [14935]node03 attrd: notice: main: Starting mainloop...
>>> 06:59:20 [14933]node03stonith-ng: info: crm_log_init_worker: Changed
>>> active directory to /var/lib/heartbeat/cores/root
>>> 06:59:20 [14933]node03stonith-ng: info: get_cluster_type: Cluster
>>> type is: 'openais'
>>> 06:59:20 [14933]node03stonith-ng: notice: crm_cluster_connect:
>>> Connecting to cluster infrastructure: classic openais (with plugin)
>>> 06:59:20 [14933]node03stonith-ng: info: init_ais_connection_classic:
>>> Creating connection to our Corosync plugin
>>> 06:59:20 [14932]node03 cib: info: crm_log_init_worker: Changed
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> 06:59:20 [14932]node03 cib: info: retrieveCib: Reading cluster
>>> configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <cib epoch="251" num_updates="0" admin_epoch="1"
>>> validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
>>> update-origin="node03" update-client="crmd" cib-last-written="Tue Apr 9
>>> 06:48:33 2013" have-quorum="1" >
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <configuration >
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <crm_config >
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <cluster_property_set id="cib-bootstrap-options" >
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <nvpair
>>> id="cib-bootstrap-options-default-resource-stickiness"
>>> name="default-resource-stickiness" value="1000" />
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <nvpair id="cib-bootstrap-options-no-quorum-policy"
>>> name="no-quorum-policy" value="ignore" />
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <nvpair id="cib-bootstrap-options-stonith-enabled"
>>> name="stonith-enabled" value="false" />
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <nvpair id="cib-bootstrap-options-expected-quorum-votes"
>>> name="expected-quorum-votes" value="3" />
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <nvpair id="cib-bootstrap-options-dc-version"
>>> name="dc-version"
>>> value="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" />
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <nvpair id="cib-bootstrap-options-cluster-infrastructure"
>>> name="cluster-infrastructure" value="openais" />
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] <nvpair id="cib-bootstrap-options-last-lrm-refresh"
>>> name="last-lrm-refresh" value="1365160119" />
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] </cluster_property_set>
>>> 06:59:20 [14932]node03 cib: debug: readCibXmlFile:
>>> [on-disk] </crm_config>
>>> …
>>> …
>>> ...
>>>
>>>
>>> We are still seeing the extra pacemaker daemons when corosync starts up.
>>> As an added check, all pacemaker daemons exited correctly when stoping
>>> corosync.
>>> ldmd attempts to start twice..
>>>
>>> ps aux | grep lrmd
>>> root 16412 0.0 0.0 0 0 ? Z 07:20 0:00 [lrmd]
>>> <defunct>
>>> root 16419 0.0 0.0 34240 1052 ? S 07:20 0:00
>>> /usr/lib64/heartbeat/lrmd
>>> root 21030 0.0 0.0 103244 856 pts/0 S+ 08:37 0:00 grep lrmd
>>>
>>>
>>> Help to resolve this issue appreciated..
>>>
>>> Cheers,
>>> Jimmy.
>>>
>>>
>>> On 9 Apr 2013, at 00:16, Andrew Beekhof <[email protected]> wrote:
>>>
>>>>
>>>> On 08/04/2013, at 9:44 PM, Jimmy Magee <[email protected]> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> thanks for your reply, we are running at debug level with the following
>>>>> config from corosync.conf
>>>>>
>>>>> logging {
>>>>> fileline: off
>>>>> to_syslog: yes
>>>>> to_stderr: no
>>>>> syslog_facility: daemon
>>>>> debug: on
>>>>> timestamp: on
>>>>> }
>>>>>
>>>>> Looking at the issue further, there seems to be 2 instances of some
>>>>> pacemaker daemons running on this particular node….
>>>>>
>>>>>
>>>>> ps aux | grep pace
>>>>>
>>>>> 495 3050 0.2 0.0 89956 7184 ? S 07:10 0:01
>>>>> /usr/libexec/pacemaker/cib
>>>>> root 3051 0.0 0.0 87128 3152 ? S 07:10 0:00
>>>>> /usr/libexec/pacemaker/stonithd
>>>>> 495 3053 0.0 0.0 91188 2840 ? S 07:10 0:00
>>>>> /usr/libexec/pacemaker/attrd
>>>>> 495 3054 0.0 0.0 87336 2484 ? S 07:10 0:00
>>>>> /usr/libexec/pacemaker/pengine
>>>>> 495 3055 0.0 0.0 91332 3156 ? S 07:10 0:00
>>>>> /usr/libexec/pacemaker/crmd
>>>>> 495 3057 0.0 0.0 88876 5224 ? S 07:10 0:00
>>>>> /usr/libexec/pacemaker/cib
>>>>> root 3058 0.0 0.0 87128 3132 ? S 07:10 0:00
>>>>> /usr/libexec/pacemaker/stonithd
>>>>> 495 3060 0.0 0.0 91188 2788 ? S 07:10 0:00
>>>>> /usr/libexec/pacemaker/attrd
>>>>> 495 3062 0.0 0.0 91436 3932 ? S 07:10 0:00
>>>>> /usr/libexec/pacemaker/crmd
>>>>>
>>>>>
>>>>> ps aux | grep corosync
>>>>> root 3044 0.1 0.0 977852 9264 ? Ssl 07:10 0:01 corosync
>>>>> root 9363 0.0 0.0 103248 856 pts/0 S+ 07:33 0:00 grep
>>>>> corosync
>>>>>
>>>>>
>>>>> ps aux | grep lrmd
>>>>> root 3052 0.0 0.0 76464 2528 ? S 07:10 0:00
>>>>> /usr/lib64/heartbeat/lrmd
>>>>>
>>>>>
>>>>> Not sure why this is the case? Appreciate any help..
>>>>>
>>>>
>>>> Have you perhaps specified "ver: 0" for the pacemaker plugin and run
>>>> "service pacemaker start" ?
>>>>
>>>>> Cheers,
>>>>> Jimmy.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 8 Apr 2013, at 03:00, Andrew Beekhof <[email protected]> wrote:
>>>>>
>>>>>> This doesn't look promising:
>>>>>>
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for
>>>>>> signal 15
>>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for
>>>>>> signal 17
>>>>>> lrmd: [4939]: info: enabling coredumps
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for
>>>>>> signal 10
>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for
>>>>>> signal 12
>>>>>> lrmd: [4939]: info: Started.
>>>>>> lrmd: [4939]: info: lrmd is shutting down
>>>>>>
>>>>>> The lrmd comes up but then immediately shuts down.
>>>>>> Perhaps try enabling debug to see if that sheds any light.
>>>>>>
>>>>>> On 06/04/2013, at 4:58 AM, Jimmy Magee <[email protected]> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> Apologies for reposting this query, it inadvertently got added to an
>>>>>>> existing topic!
>>>>>>>
>>>>>>>
>>>>>>> We have a three node cluster deployed in a customer's network:
>>>>>>> - 2 nodes are on the same switch
>>>>>>> - 3rd node on the same subnet but there's a router in between.
>>>>>>> - IP Multicast is enabled and has been tested using omping as follows..
>>>>>>>
>>>>>>> On each node ran..
>>>>>>>
>>>>>>> omping node01 node02 node3
>>>>>>>
>>>>>>>
>>>>>>> ON node 3
>>>>>>>
>>>>>>> Node01 : unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev =
>>>>>>> 0.128/0.181/0.255/0.025
>>>>>>> Node01 : multicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev =
>>>>>>> 0.140/0.187/0.219/0.021
>>>>>>> Node02 : unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev =
>>>>>>> 0.115/0.150/0.168/0.021
>>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev =
>>>>>>> 0.134/0.162/0.177/0.014
>>>>>>>
>>>>>>>
>>>>>>> On node 2
>>>>>>>
>>>>>>>
>>>>>>> Node01 : unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev =
>>>>>>> 0.168/0.191/0.205/0.014
>>>>>>> Node01 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%),
>>>>>>> min/avg/max/std-dev = 0.138/0.179/0.206/0.028
>>>>>>> Node03 : unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev =
>>>>>>> 0.112/0.149/0.175/0.022
>>>>>>> Node03 : multicast, xmt/rcv/%loss = 9/8/11% (seq>=2 0%),
>>>>>>> min/avg/max/std-dev = 0.124/0.167/0.178/0.018
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On node 1
>>>>>>>
>>>>>>> Node02 : unicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev =
>>>>>>> 0.154/0.185/0.208/0.019
>>>>>>> Node02 : multicast, xmt/rcv/%loss = 8/8/0%, min/avg/max/std-dev =
>>>>>>> 0.175/0.198/0.214/0.015
>>>>>>> Node03 : unicast, xmt/rcv/%loss = 23/23/0%, min/avg/max/std-dev =
>>>>>>> 0.114/0.160/0.185/0.019
>>>>>>> Node03 : multicast, xmt/rcv/%loss = 23/22/4% (seq>=2 0%),
>>>>>>> min/avg/max/std-dev = 0.124/0.172/0.197/0.019
>>>>>>>
>>>>>>>
>>>>>>> - Problem is intermittent but frequent. Occasionally starts fine when
>>>>>>> started from scratch.
>>>>>>>
>>>>>>> We suspect the problem is related to node 3 as we can see lrmd failures
>>>>>>> as per the attached log. We've checked permissions are ok as per
>>>>>>> https://bugs.launchpad.net/ubuntu/+source/cluster-glue/+bug/676391
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> stonith-ng[1437]: error: ais_dispatch: AIS connection failed
>>>>>>> stonith-ng[1437]: error: stonith_peer_ais_destroy: AIS connection
>>>>>>> terminated
>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: Pacemaker Cluster
>>>>>>> Manager 1.1.6
>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync extended
>>>>>>> virtual synchrony service
>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync
>>>>>>> configuration service
>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync cluster
>>>>>>> closed process group service v1.01
>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync cluster
>>>>>>> config database access v1.01
>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync profile
>>>>>>> loading service
>>>>>>> corosync[1430]: [SERV ] Service engine unloaded: corosync cluster
>>>>>>> quorum service v0.1
>>>>>>> corosync[1430]: [MAIN ] Corosync Cluster Engine exiting with status
>>>>>>> 0 at main.c:1894.
>>>>>>>
>>>>>>> corosync[4931]: [MAIN ] Corosync built-in features: nss dbus rdma
>>>>>>> snmp
>>>>>>> corosync[4931]: [MAIN ] Successfully read main configuration file
>>>>>>> '/etc/corosync/corosync.conf'.
>>>>>>> corosync[4931]: [TOTEM ] Initializing transport (UDP/IP Multicast).
>>>>>>> corosync[4931]: [TOTEM ] Initializing transmit/receive security:
>>>>>>> libtomcrypt SOBER128/SHA1HMAC (mode 0).
>>>>>>> corosync[4931]: [TOTEM ] The network interface [10.87.79.59] is now
>>>>>>> up.
>>>>>>> corosync[4931]: [pcmk ] Logging: Initialized pcmk_startup
>>>>>>> corosync[4931]: [SERV ] Service engine loaded: Pacemaker Cluster
>>>>>>> Manager 1.1.6
>>>>>>> corosync[4931]: [pcmk ] Logging: Initialized pcmk_startup
>>>>>>> corosync[4931]: [SERV ] Service engine loaded: Pacemaker Cluster
>>>>>>> Manager 1.1.6
>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync extended
>>>>>>> virtual synchrony service
>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync
>>>>>>> configuration service
>>>>>>> orosync[4931]: [SERV ] Service engine loaded: corosync cluster
>>>>>>> closed process group service v1.01
>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync cluster
>>>>>>> config database access v1.01
>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync profile
>>>>>>> loading service
>>>>>>> corosync[4931]: [SERV ] Service engine loaded: corosync cluster
>>>>>>> quorum service v0.1
>>>>>>> corosync[4931]: [MAIN ] Compatibility mode set to whitetank. Using
>>>>>>> V1 and V2 of the synchronization engine.
>>>>>>> corosync[4931]: [TOTEM ] A processor joined or left the membership
>>>>>>> and a new membership was formed.
>>>>>>> corosync[4931]: [CPG ] chosen downlist: sender r(0) ip(10.87.79.59)
>>>>>>> ; members(old:0 left:0)
>>>>>>> corosync[4931]: [MAIN ] Completed service synchronization, ready to
>>>>>>> provide service.
>>>>>>> cib[4937]: info: crm_log_init_worker: Changed active directory to
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> cib[4937]: info: retrieveCib: Reading cluster configuration from:
>>>>>>> /var/lib/heartbeat/crm/cib.xml (digest:
>>>>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>>>>> cib[4937]: info: validate_with_relaxng: Creating RNG parser context
>>>>>>> stonith-ng[4945]: info: crm_log_init_worker: Changed active
>>>>>>> directory to /var/lib/heartbeat/cores/root
>>>>>>> stonith-ng[4945]: info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> stonith-ng[4945]: notice: crm_cluster_connect: Connecting to cluster
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> stonith-ng[4945]: info: init_ais_connection_classic: Creating
>>>>>>> connection to our Corosync plugin
>>>>>>> cib[4944]: info: crm_log_init_worker: Changed active directory to
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> cib[4944]: info: retrieveCib: Reading cluster configuration from:
>>>>>>> /var/lib/heartbeat/crm/cib.xml (digest:
>>>>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>>>>> stonith-ng[4945]: info: init_ais_connection_classic: AIS connection
>>>>>>> established
>>>>>>> stonith-ng[4945]: info: get_ais_nodeid: Server details:
>>>>>>> id=1003428268 uname=w0110Danmtapp03 cname=pcmk
>>>>>>> stonith-ng[4945]: info: init_ais_connection_once: Connection to
>>>>>>> 'classic openais (with plugin)': established
>>>>>>> stonith-ng[4945]: info: crm_new_peer: Node node03 now has id:
>>>>>>> 1003428268
>>>>>>> stonith-ng[4945]: info: crm_new_peer: Node 1003428268 is now known
>>>>>>> as node03
>>>>>>> cib[4944]: info: validate_with_relaxng: Creating RNG parser context
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for
>>>>>>> signal 15
>>>>>>> lrmd: [4946]: info: Signal sent to pid=4939, waiting for process to exit
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for
>>>>>>> signal 17
>>>>>>> lrmd: [4939]: info: enabling coredumps
>>>>>>> stonith-ng[4938]: info: crm_log_init_worker: Changed active
>>>>>>> directory to /var/lib/heartbeat/cores/root
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for
>>>>>>> signal 10
>>>>>>> lrmd: [4939]: info: G_main_add_SignalHandler: Added signal handler for
>>>>>>> signal 12
>>>>>>> lrmd: [4939]: info: Started.
>>>>>>> stonith-ng[4938]: info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> lrmd: [4939]: info: lrmd is shutting down
>>>>>>> stonith-ng[4938]: notice: crm_cluster_connect: Connecting to cluster
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> stonith-ng[4938]: info: init_ais_connection_classic: Creating
>>>>>>> connection to our Corosync plugin
>>>>>>> attrd[4940]: info: crm_log_init_worker: Changed active directory to
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> pengine[4941]: info: crm_log_init_worker: Changed active directory
>>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>>> attrd[4940]: info: main: Starting up
>>>>>>> attrd[4940]: info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> attrd[4940]: notice: crm_cluster_connect: Connecting to cluster
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> attrd[4940]: info: init_ais_connection_classic: Creating connection
>>>>>>> to our Corosync plugin
>>>>>>> crmd[4942]: info: crm_log_init_worker: Changed active directory to
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> pengine[4941]: info: main: Starting pengine
>>>>>>> crmd[4942]: notice: main: CRM Hg Version:
>>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>>>>>> pengine[4948]: info: crm_log_init_worker: Changed active directory
>>>>>>> to /var/lib/heartbeat/cores/hacluster
>>>>>>> pengine[4948]: warning: main: Terminating previous PE instance
>>>>>>> attrd[4947]: info: crm_log_init_worker: Changed active directory to
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> pengine[4941]: warning: process_pe_message: Received quit message,
>>>>>>> terminating
>>>>>>> attrd[4947]: info: main: Starting up
>>>>>>> attrd[4947]: info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> attrd[4947]: notice: crm_cluster_connect: Connecting to cluster
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> attrd[4947]: info: init_ais_connection_classic: Creating connection
>>>>>>> to our Corosync plugin
>>>>>>> crmd[4949]: info: crm_log_init_worker: Changed active directory to
>>>>>>> /var/lib/heartbeat/cores/hacluster
>>>>>>> crmd[4949]: notice: main: CRM Hg Version:
>>>>>>> 148fccfd5985c5590cc601123c6c16e966b85d14
>>>>>>> stonith-ng[4938]: info: init_ais_connection_classic: AIS connection
>>>>>>> established
>>>>>>> stonith-ng[4938]: info: get_ais_nodeid: Server details:
>>>>>>> id=1003428268 uname=node03 cname=pcmk
>>>>>>> stonith-ng[4938]: info: init_ais_connection_once: Connection to
>>>>>>> 'classic openais (with plugin)': established
>>>>>>> stonith-ng[4938]: info: crm_new_peer: Node node03 now has id:
>>>>>>> 1003428268
>>>>>>> stonith-ng[4938]: info: crm_new_peer: Node 1003428268 is now known
>>>>>>> as node03
>>>>>>> attrd[4940]: info: init_ais_connection_classic: AIS connection
>>>>>>> established
>>>>>>> attrd[4940]: info: get_ais_nodeid: Server details: id=1003428268
>>>>>>> uname=node03 cname=pcmk
>>>>>>> attrd[4940]: info: init_ais_connection_once: Connection to 'classic
>>>>>>> openais (with plugin)': established
>>>>>>> attrd[4940]: info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> attrd[4940]: info: crm_new_peer: Node 1003428268 is now known as
>>>>>>> node03
>>>>>>> attrd[4940]: info: main: Cluster connection active
>>>>>>> attrd[4940]: info: main: Accepting attribute updates
>>>>>>> attrd[4940]: notice: main: Starting mainloop...
>>>>>>> attrd[4947]: info: init_ais_connection_classic: AIS connection
>>>>>>> established
>>>>>>> attrd[4947]: info: get_ais_nodeid: Server details: id=1003428268
>>>>>>> uname=node03 cname=pcmk
>>>>>>> attrd[4947]: info: init_ais_connection_once: Connection to 'classic
>>>>>>> openais (with plugin)': established
>>>>>>> attrd[4947]: info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> attrd[4947]: info: crm_new_peer: Node 1003428268 is now known as
>>>>>>> node03
>>>>>>> attrd[4947]: info: main: Cluster connection active
>>>>>>> attrd[4947]: info: main: Accepting attribute updates
>>>>>>> attrd[4947]: notice: main: Starting mainloop...
>>>>>>> cib[4937]: info: startCib: CIB Initialization completed successfully
>>>>>>> cib[4937]: info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> cib[4937]: notice: crm_cluster_connect: Connecting to cluster
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> cib[4937]: info: init_ais_connection_classic: Creating connection
>>>>>>> to our Corosync plugin
>>>>>>> cib[4944]: info: startCib: CIB Initialization completed successfully
>>>>>>> cib[4944]: info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> cib[4944]: notice: crm_cluster_connect: Connecting to cluster
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> cib[4944]: info: init_ais_connection_classic: Creating connection
>>>>>>> to our Corosync plugin
>>>>>>> cib[4937]: info: init_ais_connection_classic: AIS connection
>>>>>>> established
>>>>>>> cib[4937]: info: get_ais_nodeid: Server details: id=1003428268
>>>>>>> uname=node03 cname=pcmk
>>>>>>> cib[4937]: info: init_ais_connection_once: Connection to 'classic
>>>>>>> openais (with plugin)': established
>>>>>>> cib[4937]: info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> cib[4937]: info: crm_new_peer: Node 1003428268 is now known as
>>>>>>> node03
>>>>>>> cib[4937]: info: cib_init: Starting cib mainloop
>>>>>>> cib[4937]: info: ais_dispatch_message: Membership 6892: quorum
>>>>>>> still lost
>>>>>>> cib[4937]: info: crm_update_peer: Node node03: id=1003428268
>>>>>>> state=member (new) addr=r(0) ip(10.87.79.59) (new) votes=1 (new)
>>>>>>> born=0 seen=6892 proc=00000000000000000000000000111312 (new)
>>>>>>> cib[4944]: info: init_ais_connection_classic: AIS connection
>>>>>>> established
>>>>>>> cib[4944]: info: get_ais_nodeid: Server details: id=1003428268
>>>>>>> uname=node03 cname=pcmk
>>>>>>> cib[4944]: info: init_ais_connection_once: Connection to 'classic
>>>>>>> openais (with plugin)': established
>>>>>>> cib[4944]: info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> cib[4944]: info: crm_new_peer: Node 1003428268 is now known as
>>>>>>> node03
>>>>>>> cib[4944]: info: cib_init: Starting cib mainloop
>>>>>>> stonith-ng[4945]: notice: setup_cib: Watching for stonith topology
>>>>>>> changes
>>>>>>> stonith-ng[4945]: info: main: Starting stonith-ng mainloop
>>>>>>> cib[4937]: info: ais_dispatch_message: Membership 6896: quorum
>>>>>>> still lost
>>>>>>> corosync[4931]: [TOTEM ] A processor joined or left the membership
>>>>>>> and a new membership was formed.
>>>>>>> cib[4937]: info: crm_new_peer: Node <null> now has id: 969873836
>>>>>>> cib[4937]: info: crm_update_peer: Node (null): id=969873836
>>>>>>> state=member (new) addr=r(0) ip(172.25.207.57) votes=0 born=0
>>>>>>> seen=6896 proc=00000000000000000000000000000000
>>>>>>> cib[4937]: info: crm_new_peer: Node <null> now has id: 986651052
>>>>>>> cib[4937]: info: crm_update_peer: Node (null): id=986651052
>>>>>>> state=member (new) addr=r(0) ip(172.25.207.58) votes=0 born=0
>>>>>>> seen=6896 proc=00000000000000000000000000000000
>>>>>>> cib[4937]: notice: ais_dispatch_message: Membership 6896: quorum
>>>>>>> acquired
>>>>>>> cib[4937]: info: crm_get_peer: Node 986651052 is now known as node02
>>>>>>> cib[4937]: info: crm_update_peer: Node node02: id=986651052
>>>>>>> state=member addr=r(0) ip(172.25.207.58) votes=1 (new) born=6812
>>>>>>> seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>>> cib[4937]: info: ais_dispatch_message: Membership 6896: quorum
>>>>>>> retained
>>>>>>> cib[4937]: info: crm_get_peer: Node 969873836 is now known as node01
>>>>>>> cib[4937]: info: crm_update_peer: Node node01: id=969873836
>>>>>>> state=member addr=r(0) ip(172.25.207.57) votes=1 (new) born=6848
>>>>>>> seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4931 due to
>>>>>>> rate-limiting
>>>>>>> crmd[4942]: info: do_cib_control: CIB connection established
>>>>>>> crmd[4942]: info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> crmd[4942]: notice: crm_cluster_connect: Connecting to cluster
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> crmd[4942]: info: init_ais_connection_classic: Creating connection
>>>>>>> to our Corosync plugin
>>>>>>> cib[4937]: info: cib_process_diff: Diff 1.249.28 -> 1.249.29 not
>>>>>>> applied to 1.249.0: current "num_updates" is less than required
>>>>>>> cib[4937]: info: cib_server_process_diff: Requesting re-sync from
>>>>>>> peer
>>>>>>> crmd[4949]: info: do_cib_control: CIB connection established
>>>>>>> crmd[4949]: info: get_cluster_type: Cluster type is: 'openais'
>>>>>>> crmd[4949]: notice: crm_cluster_connect: Connecting to cluster
>>>>>>> infrastructure: classic openais (with plugin)
>>>>>>> crmd[4949]: info: init_ais_connection_classic: Creating connection
>>>>>>> to our Corosync plugin
>>>>>>> stonith-ng[4938]: notice: setup_cib: Watching for stonith topology
>>>>>>> changes
>>>>>>> stonith-ng[4938]: info: main: Starting stonith-ng mainloop
>>>>>>> cib[4937]: notice: cib_server_process_diff: Not applying diff
>>>>>>> 1.249.29 -> 1.249.30 (sync in progress)
>>>>>>> crmd[4942]: info: init_ais_connection_classic: AIS connection
>>>>>>> established
>>>>>>> crmd[4942]: info: get_ais_nodeid: Server details: id=1003428268
>>>>>>> uname=node03 cname=pcmk
>>>>>>> crmd[4942]: info: init_ais_connection_once: Connection to 'classic
>>>>>>> openais (with plugin)': established
>>>>>>> crmd[4942]: info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> crmd[4942]: info: crm_new_peer: Node 1003428268 is now known as
>>>>>>> node03
>>>>>>> crmd[4942]: info: ais_status_callback: status: node03 is now unknown
>>>>>>> crmd[4942]: info: do_ha_control: Connected to the cluster
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 1
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: init_ais_connection_classic: AIS connection
>>>>>>> established
>>>>>>> crmd[4949]: info: get_ais_nodeid: Server details: id=1003428268
>>>>>>> uname=node03 cname=pcmk
>>>>>>> crmd[4949]: info: init_ais_connection_once: Connection to 'classic
>>>>>>> openais (with plugin)': established
>>>>>>> crmd[4942]: notice: ais_dispatch_message: Membership 6896: quorum
>>>>>>> acquired
>>>>>>> crmd[4949]: info: crm_new_peer: Node node03 now has id: 1003428268
>>>>>>> crmd[4949]: info: crm_new_peer: Node 1003428268 is now known as
>>>>>>> node03
>>>>>>> crmd[4942]: info: crm_new_peer: Node node01 now has id: 969873836
>>>>>>> crmd[4949]: info: ais_status_callback: status: node03 is now unknown
>>>>>>> crmd[4942]: info: crm_new_peer: Node 969873836 is now known as
>>>>>>> node01
>>>>>>> crmd[4949]: info: do_ha_control: Connected to the cluster
>>>>>>> crmd[4942]: info: ais_status_callback: status: node01 is now unknown
>>>>>>> crmd[4942]: info: ais_status_callback: status: node01 is now member
>>>>>>> (was unknown)
>>>>>>> crmd[4942]: info: crm_update_peer: Node node01: id=969873836
>>>>>>> state=member (new) addr=r(0) ip(172.25.207.57) votes=1 born=6848
>>>>>>> seen=6896 proc=00000000000000000000000000111312
>>>>>>> crmd[4942]: info: crm_new_peer: Node node02 now has id: 986651052
>>>>>>> crmd[4942]: info: crm_new_peer: Node 986651052 is now known as
>>>>>>> node02
>>>>>>> crmd[4942]: info: ais_status_callback: status: node02 is now unknown
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 1
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: ais_status_callback: status: node02 is now member
>>>>>>> (was unknown)
>>>>>>> crmd[4942]: info: crm_update_peer: Node node02: id=986651052
>>>>>>> state=member (new) addr=r(0) ip(172.25.207.58) votes=1 born=6812
>>>>>>> seen=6896 proc=00000000000000000000000000111312
>>>>>>> crmd[4942]: notice: crmd_peer_update: Status update: Client
>>>>>>> node03/crmd now has status [online] (DC=<null>)
>>>>>>> crmd[4942]: info: ais_status_callback: status: node03 is now member
>>>>>>> (was unknown)
>>>>>>> crmd[4942]: info: crm_update_peer: Node node03: id=1003428268
>>>>>>> state=member (new) addr=r(0) ip(10.87.79.59) (new) votes=1 (new)
>>>>>>> born=6896 seen=6896 proc=00000000000000000000000000111312 (new)
>>>>>>> crmd[4942]: info: ais_dispatch_message: Membership 6896: quorum
>>>>>>> retained
>>>>>>> cib[4937]: notice: cib_server_process_diff: Not applying diff
>>>>>>> 1.249.30 -> 1.249.31 (sync in progress)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 2
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 3
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 2
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: notice: ais_dispatch_message: Membership 6896: quorum
>>>>>>> acquired
>>>>>>> rsyslogd-2177: imuxsock begins to drop messages from pid 4937 due to
>>>>>>> rate-limiting
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 4
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 5
>>>>>>> (30 max) times
>>>>>>> pengine[4948]: info: main: Starting pengine
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> warning: do_lrm_control: Failed to sign on to the LRM 6 (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 3
>>>>>>> (30 max) times
>>>>>>> attrd[4940]: info: cib_connect: Connected to the CIB after 1 signon
>>>>>>> attempts
>>>>>>> attrd[4940]: info: cib_connect: Sending full refresh
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 7
>>>>>>> (30 max) times
>>>>>>> attrd[4947]: info: cib_connect: Connected to the CIB after 1 signon
>>>>>>> attempts
>>>>>>> attrd[4947]: info: cib_connect: Sending full refresh
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 4
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 8
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 5
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 9
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 6
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 10
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 7
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 11
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 8
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 12
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 9
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 13
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 10
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 14
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 11
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 12
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 15
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 13
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 16
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 14
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 17
>>>>>>> (30 max) times
>>>>>>> crmd[4949]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4949]: warning: do_lrm_control: Failed to sign on to the LRM 15
>>>>>>> (30 max) times
>>>>>>> crmd[4942]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd[4942]: warning: do_lrm_control: Failed to sign on to the LRM 18
>>>>>>> (30 max) times
>>>>>>>
>>>>>>>
>>>>>>> We have the following components installed..
>>>>>>>
>>>>>>>
>>>>>>> corosynclib-1.4.1-15.el6.x86_64
>>>>>>> corosync-1.4.1-15.el6.x86_64
>>>>>>> cluster-glue-libs-1.0.5-6.el6.x86_64
>>>>>>> clusterlib-3.0.12.1-49.el6.x86_64
>>>>>>> pacemaker-cluster-libs-1.1.7-6.el6.x86_64
>>>>>>> cluster-glue-1.0.5-6.el6.x86_64
>>>>>>> resource-agents-3.9.2-12.el6.x86_64
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We'd appreciate assistance on how to debug what the issue may be and
>>>>>>> some possible causes.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jimmy
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> [email protected]
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems