On Mon, Jun 18, 2007 at 09:17:59PM +0200, KRAFT Benjamin wrote: > Hello everybody, > > I'm pretty new with heartbeat and i've got one problem. > The main objective of my tests were to be able of moving resources from > one node to one other easily, did it successfully with crm_resource. > > The main problem i'm experiencing is that it seems that when i fire the > first node which holds the IPaddr, the node 2 gets the ip BUT releases > it immediatly. > > The configuration : > > node 1 : sntest1.dclux.loc > * eth0 : 192.168.1.166/24 > * eth1 : 192.168.176.166/24 > > node 2 : sntest2.dclux.loc > * eth0 : 192.168.1.167/24 > * eth1 : 192.168.176.166/24 > > Resource : 192.168.1.168/24 as an alias of eth0. >
Some comments on your configuration (did you read comments in the ha.cf distributed with the heartbeat): > /etc/ha.d/ha.cf @sntest1 : > logfacility local0 > logfile /var/log/ha-log > debugfile /var/log/ha-debug It's preferred to use ha_logd instead of syslog or files. > keepalive 100ms This is way too small an interval. Go with one second, unless you really have a need for extra fast failovers. > deadtime 250ms The same. The deadtime should be several times longer than the keepalive. Otherwise you risk a split-brain. > warntime 500ms Warntime should be lower than deadtime. See the docs. > initdead 120 # depend on your hardware > udpport 694 > ping 192.168.176.167 You don't ping a node. You should ping a third party, preferably a host with long uptime and good connectivity. > bcast eth1 Always use more than one media for heartbeat. At least two network interfaces for example. > auto_failback off > node sntest1.dclux.loc > node sntest2.dclux.loc > #use_logd yes > compression bz2 > compression_threshold 2 > crm yes > > /etc/ha.d/ha.cf @sntest2 : > logfacility local0 > logfile /var/log/ha-log > debugfile /var/log/ha-debug > keepalive 100ms > deadtime 250ms > warntime 500ms > initdead 120 # depend on your hardware > udpport 694 > ping 192.168.176.166 > bcast eth1 > auto_failback off > node sntest1.dclux.loc > node sntest2.dclux.loc > #use_logd yes > compression bz2 > compression_threshold 2 > crm yes > > /var/lib/heartbeat/cib.xml : > <cib admin_epoch="0" node_fencing="yes" have_quorum="false" > ignore_dtd="false" num_peers="2" ccm_transition="3" generated="true" > dc_uuid="a3efd7cd-eca4-4ea5 > -9bba-417225edb077" epoch="2" cib_feature_revision="1.3" > num_updates="30" cib-last-written="Mon Jun 18 20:49:43 2007"> > <configuration> > <crm_config> > <cluster_property_set id="cib-bootstrap-options"> > <attributes> > <nvpair id="cib-bootstrap-options-symmetric-cluster" > name="symmetric-cluster" value="true"/> > <nvpair id="cib-bootstrap-options-no_quorum-policy" > name="no_quorum-policy" value="stop"/> > <nvpair > id="cib-bootstrap-options-default-resource-stickiness" > name="default-resource-stickiness" value="0"/> > <nvpair > id="cib-bootstrap-options-default-resource-failure-stickiness" > name="default-resource-failure-stickiness" value="0"/> > <nvpair id="cib-bootstrap-options-stonith-enabled" > name="stonith-enabled" value="false"/> > <nvpair id="cib-bootstrap-options-stonith-action" > name="stonith-action" value="reboot"/> > <nvpair id="cib-bootstrap-options-stop-orphan-resources" > name="stop-orphan-resources" value="true"/> > <nvpair id="cib-bootstrap-options-stop-orphan-actions" > name="stop-orphan-actions" value="true"/> > <nvpair id="cib-bootstrap-options-remove-after-stop" > name="remove-after-stop" value="false"/> > <nvpair id="cib-bootstrap-options-short-resource-names" > name="short-resource-names" value="true"/> > <nvpair id="cib-bootstrap-options-transition-idle-timeout" > name="transition-idle-timeout" value="5min"/> > <nvpair id="cib-bootstrap-options-default-action-timeout" > name="default-action-timeout" value="5s"/> > <nvpair id="cib-bootstrap-options-is-managed-default" > name="is-managed-default" value="true"/> > </attributes> > </cluster_property_set> > </crm_config> > <nodes> > <node id="db1c3a49-0f90-424a-8f09-c85af7ed83a8" > uname="sntest2.dclux.loc" type="normal"/> > <node id="a3efd7cd-eca4-4ea5-9bba-417225edb077" > uname="sntest1.dclux.loc" type="normal"/> > </nodes> > <resources> > <primitive class="ocf" id="IPaddr_192_168_1_168" > provider="heartbeat" type="IPaddr" failstop_type="stonith"> > <operations> > <op id="IPaddr_192_168_1_168_mon" interval="5s" > name="monitor" timeout="5s"/> > </operations> > <instance_attributes id="IPaddr_192_168_1_168_inst_attr"> > <attributes> > <nvpair id="IPaddr_192_168_1_168_attr_0" name="ip" > value="192.168.1.168"/> > <nvpair id="IPaddr_192_168_1_168_attr_1" name="netmask" > value="24"/> > <nvpair id="IPaddr_192_168_1_168_attr_2" name="nic" > value="eth0"/> > </attributes> > </instance_attributes> > </primitive> > </resources> > <constraints> > <rsc_location id="rsc_location_IPaddr_192_168_1_168" > rsc="IPaddr_192_168_1_168"> > <rule id="prefered_location_IPaddr_192_168_1_168" score="100"> > <expression attribute="#uname" > id="prefered_location_IPaddr_192_168_1_168_expr" operation="eq" > value="sntest1.dclux.loc"/> > </rule> > </rsc_location> > <rsc_location rsc="IPaddr_192_168_1_168" > id="cli-prefer-IPaddr_192_168_1_168"> > <rule score="INFINITY" id="cli-prefer-rule-IPaddr_192_168_1_168"> > <expression attribute="#uname" operation="eq" type="string" > id="cli-prefer-expr-IPaddr_192_168_1_168" value="sntest1.dclux.loc"/> > </rule> > </rsc_location> > </constraints> > </configuration> > </cib> There's one constraint too many. You should leave out the second one. > **** Time frame of the test **** > > * heartbeat running, resource IPaddr_192_168_1_168 is on sntest1. > * i do by hand : "ifconfig eth1 down" on sntest1 That's split-brain situation. You don't want that to ever happen. Ever. > here are the logs on sntest2 after this : > > heartbeat[3457]: 2007/06/18_20:49:44 WARN: node 192.168.176.166: is dead > heartbeat[3457]: 2007/06/18_20:49:44 WARN: node sntest1.dclux.loc: is dead > crmd[3472]: 2007/06/18_20:49:44 notice: crmd_ha_status_callback: Status > update: Node 192.168.176.166 now has status [dead] > heartbeat[3457]: 2007/06/18_20:49:44 info: Link > 192.168.176.166:192.168.176.166 dead. > heartbeat[3457]: 2007/06/18_20:49:44 info: Link sntest1.dclux.loc:eth1 dead. > crmd[3472]: 2007/06/18_20:49:44 WARN: get_uuid: Could not calculate UUID > for 192.168.176.166 > crmd[3472]: 2007/06/18_20:49:44 info: crmd_ha_status_callback: Ping node > 192.168.176.166 is dead > crmd[3472]: 2007/06/18_20:49:44 notice: crmd_ha_status_callback: Status > update: Node sntest1.dclux.loc now has status [dead] > cib[3468]: 2007/06/18_20:49:44 info: cib_diff_notify: Local-only Change > (client:3472, call: 34): 0.1.24 (ok) > cib[3693]: 2007/06/18_20:49:44 info: write_cib_contents: Wrote version > 0.1.24 of the CIB to disk (digest: 4e309facbe71d928e0ec8b6e52a151f0) > tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Processing diff > (cib_update): 0.1.24 -> 0.1.24 > tengine[3484]: 2007/06/18_20:49:44 WARN: match_down_event: No match for > shutdown action on a3efd7cd-eca4-4ea5-9bba-417225edb077 > tengine[3484]: 2007/06/18_20:49:44 info: extract_event: Stonith/shutdown > of a3efd7cd-eca4-4ea5-9bba-417225edb077 not matched > tengine[3484]: 2007/06/18_20:49:44 info: update_abort_priority: Abort > priority upgraded to 1000000 > tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Aborting on > transient_attributes deletions > crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition: > sntest2.dclux.loc: State transition S_IDLE -> S_POLICY_ENGINE [ > input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ] > crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition: All 2 cluster > nodes are eligable to run resources. > pengine[3485]: 2007/06/18_20:49:44 info: log_data_element: > process_pe_message: [generation] <cib admin_epoch="0" epoch="1" > num_updates="24" node_fencing="yes" generated="true" have_quorum="true" > ignore_dtd="false" num_peers="2" ccm_transition="2" > cib_feature_revision="1.3" dc_uuid="db1c3a49-0f90-424a-8f09-c85af7ed83a8"/> > pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default > value 'stop' for cluster option 'no-quorum-policy' > pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default > value '60s' for cluster option 'cluster-delay' > pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default > value '-1' for cluster option 'pe-error-series-max' > pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default > value '-1' for cluster option 'pe-warn-series-max' > pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default > value '-1' for cluster option 'pe-input-series-max' > pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default > value 'true' for cluster option 'startup-fencing' > pengine[3485]: 2007/06/18_20:49:44 info: determine_online_status: Node > sntest2.dclux.loc is online > pengine[3485]: 2007/06/18_20:49:44 info: native_print: > IPaddr_192_168_1_168 (heartbeat::ocf:IPaddr): Stopped > pengine[3485]: 2007/06/18_20:49:44 notice: StartRsc: sntest2.dclux.loc > Start IPaddr_192_168_1_168 > pengine[3485]: 2007/06/18_20:49:44 notice: Recurring: > sntest2.dclux.loc IPaddr_192_168_1_168_monitor_5000 > crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition: > sntest2.dclux.loc: State transition S_POLICY_ENGINE -> > S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE > origin=route_message ] > tengine[3484]: 2007/06/18_20:49:44 info: unpack_graph: Unpacked > transition 5: 2 actions in 2 synapses > tengine[3484]: 2007/06/18_20:49:44 info: send_rsc_command: Initiating > action 3: IPaddr_192_168_1_168_start_0 on sntest2.dclux.loc > crmd[3472]: 2007/06/18_20:49:44 info: do_lrm_rsc_op: Performing > op=IPaddr_192_168_1_168_start_0 > key=3:5:be5645bb-9fc6-4331-9bce-6cad3a7d88f1) > pengine[3485]: 2007/06/18_20:49:44 info: process_pe_message: Transition > 5: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-149.bz2 > IPaddr[3694]: 2007/06/18_20:49:44 INFO: Using calculated netmask for > 192.168.1.168: 255.255.255.0 > IPaddr[3694]: 2007/06/18_20:49:44 DEBUG: Using calculated broadcast > for 192.168.1.168: 192.168.1.255 > IPaddr[3694]: 2007/06/18_20:49:44 INFO: eval /sbin/ifconfig eth0:0 > 192.168.1.168 netmask 255.255.255.0 broadcast 192.168.1.255 > IPaddr[3694]: 2007/06/18_20:49:44 DEBUG: Sending Gratuitous Arp for > 192.168.1.168 on eth0:0 [eth0] > crmd[3472]: 2007/06/18_20:49:44 info: process_lrm_event: LRM operation > IPaddr_192_168_1_168_start_0 (call=7, rc=0) complete > crmd[3472]: 2007/06/18_20:49:44 info: append_restart_list: Resource > IPaddr_192_168_1_168 does not support reloads > cib[3468]: 2007/06/18_20:49:44 info: cib_diff_notify: Update (client: > 3472, call:37): 0.1.24 -> 0.1.25 (ok) > tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Processing diff > (cib_update): 0.1.24 -> 0.1.25 > tengine[3484]: 2007/06/18_20:49:44 info: match_graph_event: Action > IPaddr_192_168_1_168_start_0 (3) confirmed on > db1c3a49-0f90-424a-8f09-c85af7ed83a8 > tengine[3484]: 2007/06/18_20:49:44 info: send_rsc_command: Initiating > action 4: IPaddr_192_168_1_168_monitor_5000 on sntest2.dclux.loc > crmd[3472]: 2007/06/18_20:49:44 info: do_lrm_rsc_op: Performing > op=IPaddr_192_168_1_168_monitor_5000 > key=4:5:be5645bb-9fc6-4331-9bce-6cad3a7d88f1) > cib[3763]: 2007/06/18_20:49:44 info: write_cib_contents: Wrote version > 0.1.25 of the CIB to disk (digest: c564f9c3ff4c6ebfa2da07a1487e3195) > crmd[3472]: 2007/06/18_20:49:44 info: process_lrm_event: LRM operation > IPaddr_192_168_1_168_monitor_5000 (call=8, rc=0) complete > cib[3468]: 2007/06/18_20:49:44 info: cib_diff_notify: Update (client: > 3472, call:38): 0.1.25 -> 0.1.26 (ok) > tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Processing diff > (cib_update): 0.1.25 -> 0.1.26 > tengine[3484]: 2007/06/18_20:49:44 info: match_graph_event: Action > IPaddr_192_168_1_168_monitor_5000 (4) confirmed on > db1c3a49-0f90-424a-8f09-c85af7ed83a8 > tengine[3484]: 2007/06/18_20:49:44 info: run_graph: Transition 5: > (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0) > tengine[3484]: 2007/06/18_20:49:44 info: notify_crmd: Transition 5 > status: te_complete - <null> > crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition: > sntest2.dclux.loc: State transition S_TRANSITION_ENGINE -> S_IDLE [ > input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ] > cib[3778]: 2007/06/18_20:49:44 info: write_cib_contents: Wrote version > 0.1.26 of the CIB to disk (digest: b4d9fdb02151cfe83f146108fab56b8c) > cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > crmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: no mbr_track info > cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > crmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: no mbr_track info > cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: instance=3, > nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=4 > cib[3468]: 2007/06/18_20:49:46 info: cib_ccm_msg_callback: LOST: > sntest1.dclux.loc > cib[3468]: 2007/06/18_20:49:46 info: cib_ccm_msg_callback: PEER: > sntest2.dclux.loc > crmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > crmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: instance=3, > nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=4 > crmd[3472]: 2007/06/18_20:49:46 info: crmd_ccm_msg_callback: Quorum lost > after event=INVALID (id=3) > crmd[3472]: 2007/06/18_20:49:46 info: crmd_ccm_msg_callback: Quorum > lost: triggering transition (INVALID) > crmd[3472]: 2007/06/18_20:49:46 info: ccm_event_detail: INVALID: > trans=3, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=4 > crmd[3472]: 2007/06/18_20:49:46 info: ccm_event_detail: CURRENT: > sntest2.dclux.loc [nodeid=2, born=3] > cib[3468]: 2007/06/18_20:49:46 info: cib_diff_notify: Local-only Change > (client:3472, call: 39): 0.1.26 (ok) > crmd[3472]: 2007/06/18_20:49:46 info: ccm_event_detail: LOST: > sntest1.dclux.loc [nodeid=1, born=2] > tengine[3484]: 2007/06/18_20:49:46 info: update_abort_priority: Abort > priority upgraded to 1000000 > tengine[3484]: 2007/06/18_20:49:46 info: te_update_diff: Processing diff > (cib_update): 0.1.26 -> 0.1.26 > crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition: > sntest2.dclux.loc: State transition S_IDLE -> S_POLICY_ENGINE [ > input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ] > crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition: All 1 cluster > nodes are eligable to run resources. > cib[3779]: 2007/06/18_20:49:46 info: write_cib_contents: Wrote version > 0.1.26 of the CIB to disk (digest: bca029655a297d98a51ac667259e2a29) > pengine[3485]: 2007/06/18_20:49:46 info: log_data_element: > process_pe_message: [generation] <cib admin_epoch="0" epoch="1" > num_updates="26" node_fencing="yes" generated="true" have_quorum="false" > ignore_dtd="false" num_peers="2" ccm_transition="3" > cib_feature_revision="1.3" dc_uuid="db1c3a49-0f90-424a-8f09-c85af7ed83a8"/> > pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default > value 'stop' for cluster option 'no-quorum-policy' > pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default > value '60s' for cluster option 'cluster-delay' > pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default > value '-1' for cluster option 'pe-error-series-max' > pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default > value '-1' for cluster option 'pe-warn-series-max' > pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default > value '-1' for cluster option 'pe-input-series-max' > pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default > value 'true' for cluster option 'startup-fencing' > pengine[3485]: 2007/06/18_20:49:46 WARN: cluster_status: We do not have > quorum - fencing and resource management disabled > pengine[3485]: 2007/06/18_20:49:46 info: determine_online_status: Node > sntest2.dclux.loc is online > pengine[3485]: 2007/06/18_20:49:46 info: native_print: > IPaddr_192_168_1_168 (heartbeat::ocf:IPaddr): Started > sntest2.dclux.loc > pengine[3485]: 2007/06/18_20:49:46 notice: StopRsc: sntest2.dclux.loc > Stop IPaddr_192_168_1_168 > crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition: > sntest2.dclux.loc: State transition S_POLICY_ENGINE -> > S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE > origin=route_message ] > tengine[3484]: 2007/06/18_20:49:46 info: unpack_graph: Unpacked > transition 6: 1 actions in 1 synapses > tengine[3484]: 2007/06/18_20:49:46 info: send_rsc_command: Initiating > action 4: IPaddr_192_168_1_168_stop_0 on sntest2.dclux.loc > crmd[3472]: 2007/06/18_20:49:46 info: do_lrm_rsc_op: Performing > op=IPaddr_192_168_1_168_stop_0 key=4:6:be5645bb-9fc6-4331-9bce-6cad3a7d88f1) > pengine[3485]: 2007/06/18_20:49:46 info: process_pe_message: Transition > 6: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-150.bz2 > crmd[3472]: 2007/06/18_20:49:46 WARN: process_lrm_event: LRM operation > IPaddr_192_168_1_168_monitor_5000 (call=8, rc=-2) Cancelled > lrmd[3469]: 2007/06/18_20:49:46 info: RA output: > (IPaddr_192_168_1_168:stop:stderr) SIOCDELRT: No such process > > IPaddr[3780]: 2007/06/18_20:49:46 INFO: /sbin/ifconfig eth0:0 > 192.168.1.168 down > crmd[3472]: 2007/06/18_20:49:46 info: process_lrm_event: LRM operation > IPaddr_192_168_1_168_stop_0 (call=10, rc=0) complete > cib[3468]: 2007/06/18_20:49:46 info: cib_diff_notify: Update (client: > 3472, call:41): 0.1.26 -> 0.1.27 (ok) > tengine[3484]: 2007/06/18_20:49:46 info: te_update_diff: Processing diff > (cib_update): 0.1.26 -> 0.1.27 > tengine[3484]: 2007/06/18_20:49:46 info: match_graph_event: Action > IPaddr_192_168_1_168_stop_0 (4) confirmed on > db1c3a49-0f90-424a-8f09-c85af7ed83a8 > tengine[3484]: 2007/06/18_20:49:46 info: run_graph: Transition 6: > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0) > tengine[3484]: 2007/06/18_20:49:46 info: notify_crmd: Transition 6 > status: te_complete - <null> > crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition: > sntest2.dclux.loc: State transition S_TRANSITION_ENGINE -> S_IDLE [ > input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ] > cib[3800]: 2007/06/18_20:49:46 info: write_cib_contents: Wrote version > 0.1.27 of the CIB to disk (digest: d622380a2238356b807686a901cd1b6c) > > and then ... nothing. > sntest2 doesn't get the resource. Your "surviving" node, sntest2, looses connectivity (the ping node is dead). It also looses the quorum, probably because of a loss of connectivity (I can't recall exactly, but that should be the case). Basically, you create a situation where neither node can run a resourse. > What am I missing ? Try to figure out how would you behave in such a situation. Try to mimick a cluster. There's no magic in heartbeat or any other peace of software, I'm afraid, which could make the situation you produced workable. I'd like to suggest to go on and read some documentation on clusters in general and then some on heartbeat. > > Thanks in advance, > > Benjamin > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems -- Dejan _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
