[Linux-HA] IPaddr failover problem

KRAFT Benjamin Mon, 18 Jun 2007 12:43:05 -0700

Hello everybody,

I'm pretty new with heartbeat and i've got one problem.

The main objective of my tests were to be able of moving resources fromone node to one other easily, did it successfully with crm_resource.

The main problem i'm experiencing is that it seems that when i fire thefirst node which holds the IPaddr, the node 2 gets the ip BUT releasesit immediatly.


The configuration :

node 1 : sntest1.dclux.loc
* eth0 : 192.168.1.166/24
* eth1 : 192.168.176.166/24

node 2 : sntest2.dclux.loc
* eth0 : 192.168.1.167/24
* eth1 : 192.168.176.166/24

Resource : 192.168.1.168/24 as an alias of eth0.


/etc/ha.d/ha.cf @sntest1 :
logfacility    local0
logfile        /var/log/ha-log
debugfile /var/log/ha-debug
keepalive 100ms
deadtime 250ms
warntime 500ms
initdead 120 # depend on your hardware
udpport 694
ping 192.168.176.167
bcast eth1
auto_failback off
node    sntest1.dclux.loc
node    sntest2.dclux.loc
#use_logd yes
compression     bz2
compression_threshold 2
crm yes

/etc/ha.d/ha.cf @sntest2 :
logfacility    local0
logfile        /var/log/ha-log
debugfile /var/log/ha-debug
keepalive 100ms
deadtime 250ms
warntime 500ms
initdead 120 # depend on your hardware
udpport 694
ping 192.168.176.166
bcast eth1
auto_failback off
node    sntest1.dclux.loc
node    sntest2.dclux.loc
#use_logd yes
compression     bz2
compression_threshold 2
crm yes

/var/lib/heartbeat/cib.xml :

<cib admin_epoch="0" node_fencing="yes" have_quorum="false"ignore_dtd="false" num_peers="2" ccm_transition="3" generated="true"dc_uuid="a3efd7cd-eca4-4ea5-9bba-417225edb077" epoch="2" cib_feature_revision="1.3"num_updates="30" cib-last-written="Mon Jun 18 20:49:43 2007">

  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <attributes>

<nvpair id="cib-bootstrap-options-symmetric-cluster"name="symmetric-cluster" value="true"/><nvpair id="cib-bootstrap-options-no_quorum-policy"name="no_quorum-policy" value="stop"/><nvpairid="cib-bootstrap-options-default-resource-stickiness"name="default-resource-stickiness" value="0"/><nvpairid="cib-bootstrap-options-default-resource-failure-stickiness"name="default-resource-failure-stickiness" value="0"/><nvpair id="cib-bootstrap-options-stonith-enabled"name="stonith-enabled" value="false"/><nvpair id="cib-bootstrap-options-stonith-action"name="stonith-action" value="reboot"/><nvpair id="cib-bootstrap-options-stop-orphan-resources"name="stop-orphan-resources" value="true"/><nvpair id="cib-bootstrap-options-stop-orphan-actions"name="stop-orphan-actions" value="true"/><nvpair id="cib-bootstrap-options-remove-after-stop"name="remove-after-stop" value="false"/><nvpair id="cib-bootstrap-options-short-resource-names"name="short-resource-names" value="true"/><nvpair id="cib-bootstrap-options-transition-idle-timeout"name="transition-idle-timeout" value="5min"/><nvpair id="cib-bootstrap-options-default-action-timeout"name="default-action-timeout" value="5s"/><nvpair id="cib-bootstrap-options-is-managed-default"name="is-managed-default" value="true"/>

        </attributes>
      </cluster_property_set>
    </crm_config>
    <nodes>

<node id="db1c3a49-0f90-424a-8f09-c85af7ed83a8"uname="sntest2.dclux.loc" type="normal"/><node id="a3efd7cd-eca4-4ea5-9bba-417225edb077"uname="sntest1.dclux.loc" type="normal"/>

    </nodes>
    <resources>

<primitive class="ocf" id="IPaddr_192_168_1_168"provider="heartbeat" type="IPaddr" failstop_type="stonith">

        <operations>

<op id="IPaddr_192_168_1_168_mon" interval="5s"name="monitor" timeout="5s"/>

        </operations>
        <instance_attributes id="IPaddr_192_168_1_168_inst_attr">
          <attributes>

<nvpair id="IPaddr_192_168_1_168_attr_0" name="ip"value="192.168.1.168"/><nvpair id="IPaddr_192_168_1_168_attr_1" name="netmask"value="24"/><nvpair id="IPaddr_192_168_1_168_attr_2" name="nic"value="eth0"/>

          </attributes>
        </instance_attributes>
      </primitive>
    </resources>
    <constraints>

<rsc_location id="rsc_location_IPaddr_192_168_1_168"rsc="IPaddr_192_168_1_168">

        <rule id="prefered_location_IPaddr_192_168_1_168" score="100">

<expression attribute="#uname"id="prefered_location_IPaddr_192_168_1_168_expr" operation="eq"value="sntest1.dclux.loc"/>

        </rule>
      </rsc_location>

<rsc_location rsc="IPaddr_192_168_1_168"id="cli-prefer-IPaddr_192_168_1_168">

        <rule score="INFINITY" id="cli-prefer-rule-IPaddr_192_168_1_168">

<expression attribute="#uname" operation="eq" type="string"id="cli-prefer-expr-IPaddr_192_168_1_168" value="sntest1.dclux.loc"/>

        </rule>
      </rsc_location>
    </constraints>
  </configuration>
</cib>


**** Time frame of the test ****

* heartbeat running, resource IPaddr_192_168_1_168 is on sntest1.
* i do by hand : "ifconfig eth1 down" on sntest1

here are the logs on sntest2 after this :

heartbeat[3457]: 2007/06/18_20:49:44 WARN: node 192.168.176.166: is dead
heartbeat[3457]: 2007/06/18_20:49:44 WARN: node sntest1.dclux.loc: is dead

crmd[3472]: 2007/06/18_20:49:44 notice: crmd_ha_status_callback: Statusupdate: Node 192.168.176.166 now has status [dead]heartbeat[3457]: 2007/06/18_20:49:44 info: Link192.168.176.166:192.168.176.166 dead.

heartbeat[3457]: 2007/06/18_20:49:44 info: Link sntest1.dclux.loc:eth1 dead.

crmd[3472]: 2007/06/18_20:49:44 WARN: get_uuid: Could not calculate UUIDfor 192.168.176.166crmd[3472]: 2007/06/18_20:49:44 info: crmd_ha_status_callback: Ping node192.168.176.166 is deadcrmd[3472]: 2007/06/18_20:49:44 notice: crmd_ha_status_callback: Statusupdate: Node sntest1.dclux.loc now has status [dead]cib[3468]: 2007/06/18_20:49:44 info: cib_diff_notify: Local-only Change(client:3472, call: 34): 0.1.24 (ok)cib[3693]: 2007/06/18_20:49:44 info: write_cib_contents: Wrote version0.1.24 of the CIB to disk (digest: 4e309facbe71d928e0ec8b6e52a151f0)tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Processing diff(cib_update): 0.1.24 -> 0.1.24tengine[3484]: 2007/06/18_20:49:44 WARN: match_down_event: No match forshutdown action on a3efd7cd-eca4-4ea5-9bba-417225edb077tengine[3484]: 2007/06/18_20:49:44 info: extract_event: Stonith/shutdownof a3efd7cd-eca4-4ea5-9bba-417225edb077 not matchedtengine[3484]: 2007/06/18_20:49:44 info: update_abort_priority: Abortpriority upgraded to 1000000tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Aborting ontransient_attributes deletionscrmd[3472]: 2007/06/18_20:49:44 info: do_state_transition:sntest2.dclux.loc: State transition S_IDLE -> S_POLICY_ENGINE [input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ]crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition: All 2 clusternodes are eligable to run resources.pengine[3485]: 2007/06/18_20:49:44 info: log_data_element:process_pe_message: [generation] <cib admin_epoch="0" epoch="1"num_updates="24" node_fencing="yes" generated="true" have_quorum="true"ignore_dtd="false" num_peers="2" ccm_transition="2"cib_feature_revision="1.3" dc_uuid="db1c3a49-0f90-424a-8f09-c85af7ed83a8"/>pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using defaultvalue 'stop' for cluster option 'no-quorum-policy'pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using defaultvalue '60s' for cluster option 'cluster-delay'pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using defaultvalue '-1' for cluster option 'pe-error-series-max'pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using defaultvalue '-1' for cluster option 'pe-warn-series-max'pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using defaultvalue '-1' for cluster option 'pe-input-series-max'pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using defaultvalue 'true' for cluster option 'startup-fencing'pengine[3485]: 2007/06/18_20:49:44 info: determine_online_status: Nodesntest2.dclux.loc is onlinepengine[3485]: 2007/06/18_20:49:44 info: native_print:IPaddr_192_168_1_168 (heartbeat::ocf:IPaddr): Stoppedpengine[3485]: 2007/06/18_20:49:44 notice: StartRsc: sntest2.dclux.locStart IPaddr_192_168_1_168pengine[3485]: 2007/06/18_20:49:44 notice: Recurring:sntest2.dclux.loc IPaddr_192_168_1_168_monitor_5000crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition:sntest2.dclux.loc: State transition S_POLICY_ENGINE ->S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGEorigin=route_message ]tengine[3484]: 2007/06/18_20:49:44 info: unpack_graph: Unpackedtransition 5: 2 actions in 2 synapsestengine[3484]: 2007/06/18_20:49:44 info: send_rsc_command: Initiatingaction 3: IPaddr_192_168_1_168_start_0 on sntest2.dclux.loccrmd[3472]: 2007/06/18_20:49:44 info: do_lrm_rsc_op: Performingop=IPaddr_192_168_1_168_start_0key=3:5:be5645bb-9fc6-4331-9bce-6cad3a7d88f1)pengine[3485]: 2007/06/18_20:49:44 info: process_pe_message: Transition5: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-149.bz2IPaddr[3694]: 2007/06/18_20:49:44 INFO: Using calculated netmask for192.168.1.168: 255.255.255.0IPaddr[3694]: 2007/06/18_20:49:44 DEBUG: Using calculated broadcastfor 192.168.1.168: 192.168.1.255IPaddr[3694]: 2007/06/18_20:49:44 INFO: eval /sbin/ifconfig eth0:0192.168.1.168 netmask 255.255.255.0 broadcast 192.168.1.255IPaddr[3694]: 2007/06/18_20:49:44 DEBUG: Sending Gratuitous Arp for192.168.1.168 on eth0:0 [eth0]crmd[3472]: 2007/06/18_20:49:44 info: process_lrm_event: LRM operationIPaddr_192_168_1_168_start_0 (call=7, rc=0) completecrmd[3472]: 2007/06/18_20:49:44 info: append_restart_list: ResourceIPaddr_192_168_1_168 does not support reloadscib[3468]: 2007/06/18_20:49:44 info: cib_diff_notify: Update (client:3472, call:37): 0.1.24 -> 0.1.25 (ok)tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Processing diff(cib_update): 0.1.24 -> 0.1.25tengine[3484]: 2007/06/18_20:49:44 info: match_graph_event: ActionIPaddr_192_168_1_168_start_0 (3) confirmed ondb1c3a49-0f90-424a-8f09-c85af7ed83a8tengine[3484]: 2007/06/18_20:49:44 info: send_rsc_command: Initiatingaction 4: IPaddr_192_168_1_168_monitor_5000 on sntest2.dclux.loccrmd[3472]: 2007/06/18_20:49:44 info: do_lrm_rsc_op: Performingop=IPaddr_192_168_1_168_monitor_5000key=4:5:be5645bb-9fc6-4331-9bce-6cad3a7d88f1)cib[3763]: 2007/06/18_20:49:44 info: write_cib_contents: Wrote version0.1.25 of the CIB to disk (digest: c564f9c3ff4c6ebfa2da07a1487e3195)crmd[3472]: 2007/06/18_20:49:44 info: process_lrm_event: LRM operationIPaddr_192_168_1_168_monitor_5000 (call=8, rc=0) completecib[3468]: 2007/06/18_20:49:44 info: cib_diff_notify: Update (client:3472, call:38): 0.1.25 -> 0.1.26 (ok)tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Processing diff(cib_update): 0.1.25 -> 0.1.26tengine[3484]: 2007/06/18_20:49:44 info: match_graph_event: ActionIPaddr_192_168_1_168_monitor_5000 (4) confirmed ondb1c3a49-0f90-424a-8f09-c85af7ed83a8tengine[3484]: 2007/06/18_20:49:44 info: run_graph: Transition 5:(Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0)tengine[3484]: 2007/06/18_20:49:44 info: notify_crmd: Transition 5status: te_complete - <null>crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition:sntest2.dclux.loc: State transition S_TRANSITION_ENGINE -> S_IDLE [input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]cib[3778]: 2007/06/18_20:49:44 info: write_cib_contents: Wrote version0.1.26 of the CIB to disk (digest: b4d9fdb02151cfe83f146108fab56b8c)cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: Got an eventOC_EV_MS_INVALID from ccmcrmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: Got an eventOC_EV_MS_INVALID from ccm

cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: no mbr_track info

cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: Got an eventOC_EV_MS_INVALID from ccm

crmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: no mbr_track info

cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: instance=3,nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=4cib[3468]: 2007/06/18_20:49:46 info: cib_ccm_msg_callback: LOST:sntest1.dclux.loccib[3468]: 2007/06/18_20:49:46 info: cib_ccm_msg_callback: PEER:sntest2.dclux.loccrmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: Got an eventOC_EV_MS_INVALID from ccmcrmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: instance=3,nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=4crmd[3472]: 2007/06/18_20:49:46 info: crmd_ccm_msg_callback: Quorum lostafter event=INVALID (id=3)crmd[3472]: 2007/06/18_20:49:46 info: crmd_ccm_msg_callback: Quorumlost: triggering transition (INVALID)crmd[3472]: 2007/06/18_20:49:46 info: ccm_event_detail: INVALID:trans=3, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=4crmd[3472]: 2007/06/18_20:49:46 info: ccm_event_detail: CURRENT:sntest2.dclux.loc [nodeid=2, born=3]cib[3468]: 2007/06/18_20:49:46 info: cib_diff_notify: Local-only Change(client:3472, call: 39): 0.1.26 (ok)crmd[3472]: 2007/06/18_20:49:46 info: ccm_event_detail: LOST:sntest1.dclux.loc [nodeid=1, born=2]tengine[3484]: 2007/06/18_20:49:46 info: update_abort_priority: Abortpriority upgraded to 1000000tengine[3484]: 2007/06/18_20:49:46 info: te_update_diff: Processing diff(cib_update): 0.1.26 -> 0.1.26crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition:sntest2.dclux.loc: State transition S_IDLE -> S_POLICY_ENGINE [input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ]crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition: All 1 clusternodes are eligable to run resources.cib[3779]: 2007/06/18_20:49:46 info: write_cib_contents: Wrote version0.1.26 of the CIB to disk (digest: bca029655a297d98a51ac667259e2a29)pengine[3485]: 2007/06/18_20:49:46 info: log_data_element:process_pe_message: [generation] <cib admin_epoch="0" epoch="1"num_updates="26" node_fencing="yes" generated="true" have_quorum="false"ignore_dtd="false" num_peers="2" ccm_transition="3"cib_feature_revision="1.3" dc_uuid="db1c3a49-0f90-424a-8f09-c85af7ed83a8"/>pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using defaultvalue 'stop' for cluster option 'no-quorum-policy'pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using defaultvalue '60s' for cluster option 'cluster-delay'pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using defaultvalue '-1' for cluster option 'pe-error-series-max'pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using defaultvalue '-1' for cluster option 'pe-warn-series-max'pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using defaultvalue '-1' for cluster option 'pe-input-series-max'pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using defaultvalue 'true' for cluster option 'startup-fencing'pengine[3485]: 2007/06/18_20:49:46 WARN: cluster_status: We do not havequorum - fencing and resource management disabledpengine[3485]: 2007/06/18_20:49:46 info: determine_online_status: Nodesntest2.dclux.loc is onlinepengine[3485]: 2007/06/18_20:49:46 info: native_print:IPaddr_192_168_1_168 (heartbeat::ocf:IPaddr): Startedsntest2.dclux.locpengine[3485]: 2007/06/18_20:49:46 notice: StopRsc: sntest2.dclux.locStop IPaddr_192_168_1_168crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition:sntest2.dclux.loc: State transition S_POLICY_ENGINE ->S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGEorigin=route_message ]tengine[3484]: 2007/06/18_20:49:46 info: unpack_graph: Unpackedtransition 6: 1 actions in 1 synapsestengine[3484]: 2007/06/18_20:49:46 info: send_rsc_command: Initiatingaction 4: IPaddr_192_168_1_168_stop_0 on sntest2.dclux.loccrmd[3472]: 2007/06/18_20:49:46 info: do_lrm_rsc_op: Performingop=IPaddr_192_168_1_168_stop_0 key=4:6:be5645bb-9fc6-4331-9bce-6cad3a7d88f1)pengine[3485]: 2007/06/18_20:49:46 info: process_pe_message: Transition6: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-150.bz2crmd[3472]: 2007/06/18_20:49:46 WARN: process_lrm_event: LRM operationIPaddr_192_168_1_168_monitor_5000 (call=8, rc=-2) Cancelledlrmd[3469]: 2007/06/18_20:49:46 info: RA output:(IPaddr_192_168_1_168:stop:stderr) SIOCDELRT: No such process

IPaddr[3780]: 2007/06/18_20:49:46 INFO: /sbin/ifconfig eth0:0192.168.1.168 downcrmd[3472]: 2007/06/18_20:49:46 info: process_lrm_event: LRM operationIPaddr_192_168_1_168_stop_0 (call=10, rc=0) completecib[3468]: 2007/06/18_20:49:46 info: cib_diff_notify: Update (client:3472, call:41): 0.1.26 -> 0.1.27 (ok)tengine[3484]: 2007/06/18_20:49:46 info: te_update_diff: Processing diff(cib_update): 0.1.26 -> 0.1.27tengine[3484]: 2007/06/18_20:49:46 info: match_graph_event: ActionIPaddr_192_168_1_168_stop_0 (4) confirmed ondb1c3a49-0f90-424a-8f09-c85af7ed83a8tengine[3484]: 2007/06/18_20:49:46 info: run_graph: Transition 6:(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0)tengine[3484]: 2007/06/18_20:49:46 info: notify_crmd: Transition 6status: te_complete - <null>crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition:sntest2.dclux.loc: State transition S_TRANSITION_ENGINE -> S_IDLE [input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]cib[3800]: 2007/06/18_20:49:46 info: write_cib_contents: Wrote version0.1.27 of the CIB to disk (digest: d622380a2238356b807686a901cd1b6c)


and then ... nothing.
sntest2 doesn't get the resource.

What am I missing ?

Thanks in advance,

Benjamin
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] IPaddr failover problem

Reply via email to