I've just realised I have a classic split-brain with my CRM setup. I'm running
pacemaker 1.1.6-2ubuntu0~ppa2 (installed from ubuntu-ha-maintainers-ppa-lucid)
and heartbeat 1:3.0.5-3ubuntu0~ppa1 on Ubuntu Lucid. I have 3 IPAddr2, 3
SendArp, 3 MailTo resources set up on two servers (front ends running haproxy).
This was all working fine, but I checked crm_mon today and found that each node
shows the other as offline, and they are both publishing the same floating IPs
simultaneously! Wierdly, everything still seems to be working!
I can't see any reason for this - it was working fine previously and config has
not changed: servers are up and running, firewall ports are open (each node
allows UDP on port 694 from the other machine). crm_mon shows this:
============
Last updated: Tue Jun 26 16:28:23 2012
Last change: Tue Mar 27 22:19:17 2012
Stack: Heartbeat
Current DC: proxy1.example.com (68890308-615b-4b28-bb8b-5aa00bdbf65c) -
partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 1 expected votes
10 Resources configured.
============
Online: [ proxy1.example.com ]
OFFLINE: [ proxy2.example.com ]
Resource Group: proxyfloat
ip1 (ocf::heartbeat:IPaddr2): Started proxy1.example.com
ip1arp (ocf::heartbeat:SendArp): Started proxy1.example.com
ip1email (ocf::heartbeat:MailTo): Started proxy1.example.com
Resource Group: proxyfloat2
ip2 (ocf::heartbeat:IPaddr2): Started proxy1.example.com
ip2arp (ocf::heartbeat:SendArp): Started proxy1.example.com
ip2email (ocf::heartbeat:MailTo): Started proxy1.example.com
Resource Group: proxyfloat3
ip3 (ocf::heartbeat:IPaddr2): Started proxy1.example.com
ip3arp (ocf::heartbeat:SendArp): Started proxy1.example.com
ip3email (ocf::heartbeat:MailTo): Started proxy1.example.com
============
Last updated: Tue Jun 26 16:28:09 2012
Last change: Tue Mar 27 22:19:17 2012
Stack: Heartbeat
Current DC: proxy2.example.com (30a5636b-26f6-4c31-9ea7-d4fb912ee624) -
partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 1 expected votes
10 Resources configured.
============
Online: [ proxy2.example.com ]
OFFLINE: [ proxy1.example.com ]
Resource Group: proxyfloat
ip1 (ocf::heartbeat:IPaddr2): Started proxy2.example.com
ip1arp (ocf::heartbeat:SendArp): Started proxy2.example.com
ip1email (ocf::heartbeat:MailTo): Started proxy2.example.com
Resource Group: proxyfloat2
ip2 (ocf::heartbeat:IPaddr2): Started proxy2.example.com
ip2arp (ocf::heartbeat:SendArp): Started proxy2.example.com
ip2email (ocf::heartbeat:MailTo): Started proxy2.example.com
Resource Group: proxyfloat3
ip3 (ocf::heartbeat:IPaddr2): Started proxy2.example.com
ip3arp (ocf::heartbeat:SendArp): Started proxy2.example.com
ip3email (ocf::heartbeat:MailTo): Started proxy2.example.com
Both servers are logging this sequence every 10 minutes or so:
Jun 26 06:44:18 proxy1 crmd: [3205]: info: crm_timer_popped: PEngine Recheck
Timer (I_PE_CALC) just popped (900000ms)
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Progressed to
state S_POLICY_ENGINE after C_TIMER_POPPED
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: All 1 cluster
nodes are eligible to run resources.
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke: Query 1746: Requesting
the current CIB: S_POLICY_ENGINE
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke_callback: Invoking the
PE: query=1746, ref=pe_calc-dc-1340693058-1731, seq=3, quorate=1
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation
ip2arp_last_failure_0 found resource ip2arp active on proxy1.example.com
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation
ip1arp_last_failure_0 found resource ip1arp active on proxy1.example.com
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation
ip3_last_failure_0 found resource ip3 active on proxy1.example.com
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation
ip3arp_last_failure_0 found resource ip3arp active on proxy1.example.com
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
email_alert#011(Stopped)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
ip1#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
ip1arp#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
ip1email#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
ip2#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
ip2arp#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
ip2email#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
ip3#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
ip3arp#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 crmd: [3205]: info: unpack_graph: Unpacked transition
1653: 0 actions in 0 synapses
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave
ip3email#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_te_invoke: Processing graph 1653
(ref=pe_calc-dc-1340693058-1731) derived from /var/lib/pengine/pe-input-35.bz2
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: process_pe_message: Transition
1653: PEngine Input stored in: /var/lib/pengine/pe-input-35.bz2
Jun 26 06:44:18 proxy1 crmd: [3205]: info: run_graph:
====================================================
Jun 26 06:44:18 proxy1 crmd: [3205]: notice: run_graph: Transition 1653
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-35.bz2): Complete
Jun 26 06:44:18 proxy1 crmd: [3205]: info: te_graph_trigger: Transition 1653 is
now complete
Jun 26 06:44:18 proxy1 crmd: [3205]: info: notify_crmd: Transition 1653 status:
done - <null>
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Starting
PEngine Recheck Timer
How can I diagnose why they are not talking to each other?
Marcus
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems