[Linux-HA] Ipaddr resource unwanted stop after one node dies

Benjamin KRAFT Wed, 20 Jun 2007 00:32:36 -0700

Hello everybody,



The main objective of my tests were to be able of moving resources from one 
node to one other easily, did it successfully with crm_resource.



The main problem i'm experiencing is that it seems that when i fire the first 
node which holds the IPaddr, the node 2 gets the ip BUT releases  it immediatly.



The configuration :



node 1 : sntest1.dclux.loc

* eth0 : 192.168.1.166/24

* eth1 : 192.168.176.166/24



node 2 : sntest2.dclux.loc

* eth0 : 192.168.1.167/24

* eth1 : 192.168.176.166/24



Resource : 192.168.1.168/24 as an alias of eth0.





/etc/ha.d/ha.cf @sntest1 :

logfacility local0

logfile /var/log/ha-log

debugfile /var/log/ha-debug

keepalive 100ms

deadtime 250ms

warntime 500ms

initdead 120 # depend on your hardware

udpport 694

ping 192.168.176.167

bcast eth1

auto_failback off

node sntest1.dclux.loc

node sntest2.dclux.loc

#use_logd yes

compression bz2

compression_threshold 2

crm yes



/etc/ha.d/ha.cf @sntest2 :

logfacility local0

logfile /var/log/ha-log

debugfile /var/log/ha-debug

keepalive 100ms

deadtime 250ms

warntime 500ms

initdead 120 # depend on your hardware

udpport 694

ping 192.168.176.166

bcast eth1

auto_failback off

node sntest1.dclux.loc

node sntest2.dclux.loc

#use_logd yes

compression bz2

compression_threshold 2

crm yes



/var/lib/heartbeat/cib.xml :

<cib admin_epoch="0" node_fencing="yes" have_quorum="false" ignore_dtd="false" 
num_peers="2" ccm_transition="3" generated="true" 
dc_uuid="a3efd7cd-eca4-4ea5-9bba-417225edb077" epoch="2" 
cib_feature_revision="1.3" num_updates="30" cib-last-written="Mon Jun 18 
20:49:43 2007">

<configuration>

<crm_config>

<cluster_property_set id="cib-bootstrap-options">

<attributes>

<nvpair id="cib-bootstrap-options-symmetric-cluster" name="symmetric-cluster" 
value="true"/>

<nvpair id="cib-bootstrap-options-no_quorum-policy" name="no_quorum-policy" 
value="stop"/>

<nvpair id="cib-bootstrap-options-default-resource-stickiness" 
name="default-resource-stickiness" value="0"/>

<nvpair id="cib-bootstrap-options-default-resource-failure-stickiness" 
name="default-resource-failure-stickiness" value="0"/>

<nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" 
value="false"/>

<nvpair id="cib-bootstrap-options-stonith-action" name="stonith-action" 
value="reboot"/>

<nvpair id="cib-bootstrap-options-stop-orphan-resources" 
name="stop-orphan-resources" value="true"/>

<nvpair id="cib-bootstrap-options-stop-orphan-actions" 
name="stop-orphan-actions" value="true"/>

<nvpair id="cib-bootstrap-options-remove-after-stop" name="remove-after-stop" 
value="false"/>

<nvpair id="cib-bootstrap-options-short-resource-names" 
name="short-resource-names" value="true"/>

<nvpair id="cib-bootstrap-options-transition-idle-timeout" 
name="transition-idle-timeout" value="5min"/>

<nvpair id="cib-bootstrap-options-default-action-timeout" 
name="default-action-timeout" value="5s"/>

<nvpair id="cib-bootstrap-options-is-managed-default" name="is-managed-default" 
value="true"/>

</attributes>

</cluster_property_set>

</crm_config>

<nodes>

<node id="db1c3a49-0f90-424a-8f09-c85af7ed83a8" uname="sntest2.dclux.loc" 
type="normal"/>

<node id="a3efd7cd-eca4-4ea5-9bba-417225edb077" uname="sntest1.dclux.loc" 
type="normal"/>

</nodes>

<resources>

<primitive class="ocf" id="IPaddr_192_168_1_168" provider="heartbeat" 
type="IPaddr" failstop_type="stonith">

<operations>

<op id="IPaddr_192_168_1_168_mon" interval="5s" name="monitor" timeout="5s"/>

</operations>

<instance_attributes id="IPaddr_192_168_1_168_inst_attr">

<attributes>

<nvpair id="IPaddr_192_168_1_168_attr_0" name="ip" value="192.168.1.168"/>

<nvpair id="IPaddr_192_168_1_168_attr_1" name="netmask" value="24"/>

<nvpair id="IPaddr_192_168_1_168_attr_2" name="nic" value="eth0"/>

</attributes>

</instance_attributes>

</primitive>

</resources>

<constraints>

<rsc_location id="rsc_location_IPaddr_192_168_1_168" rsc="IPaddr_192_168_1_168">

<rule id="prefered_location_IPaddr_192_168_1_168" score="100">

<expression attribute="#uname" id="prefered_location_IPaddr_192_168_1_168_expr" 
operation="eq" value="sntest1.dclux.loc"/>

</rule>

</rsc_location>

<rsc_location rsc="IPaddr_192_168_1_168" id="cli-prefer-IPaddr_192_168_1_168">

<rule score="INFINITY" id="cli-prefer-rule-IPaddr_192_168_1_168">

<expression attribute="#uname" operation="eq" type="string" 
id="cli-prefer-expr-IPaddr_192_168_1_168" value="sntest1.dclux.loc"/>

</rule>

</rsc_location>

</constraints>

</configuration>

</cib>





**** Time frame of the test ****



* heartbeat running, resource IPaddr_192_168_1_168 is on sntest1.

* i do by hand : "ifconfig eth1 down" on sntest1



here are the logs on sntest2 after this :



heartbeat[3457]: 2007/06/18_20:49:44 WARN: node 192.168.176.166: is dead

heartbeat[3457]: 2007/06/18_20:49:44 WARN: node sntest1.dclux.loc: is dead

crmd[3472]: 2007/06/18_20:49:44 notice: crmd_ha_status_callback: Status update: 
Node 192.168.176.166 now has status [dead]

heartbeat[3457]: 2007/06/18_20:49:44 info: Link 192.168.176.166:192.168.176.166 
dead.

heartbeat[3457]: 2007/06/18_20:49:44 info: Link sntest1.dclux.loc:eth1 dead.

crmd[3472]: 2007/06/18_20:49:44 WARN: get_uuid: Could not calculate UUID for 
192.168.176.166

crmd[3472]: 2007/06/18_20:49:44 info: crmd_ha_status_callback: Ping node 
192.168.176.166 is dead

crmd[3472]: 2007/06/18_20:49:44 notice: crmd_ha_status_callback: Status update: 
Node sntest1.dclux.loc now has status [dead]

cib[3468]: 2007/06/18_20:49:44 info: cib_diff_notify: Local-only Change 
(client:3472, call: 34): 0.1.24 (ok)

cib[3693]: 2007/06/18_20:49:44 info: write_cib_contents: Wrote version 0.1.24 
of the CIB to disk (digest: 4e309facbe71d928e0ec8b6e52a151f0)

tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Processing diff 
(cib_update): 0.1.24 -> 0.1.24

tengine[3484]: 2007/06/18_20:49:44 WARN: match_down_event: No match for 
shutdown action on a3efd7cd-eca4-4ea5-9bba-417225edb077

tengine[3484]: 2007/06/18_20:49:44 info: extract_event: Stonith/shutdown of 
a3efd7cd-eca4-4ea5-9bba-417225edb077 not matched

tengine[3484]: 2007/06/18_20:49:44 info: update_abort_priority: Abort priority 
upgraded to 1000000

tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Aborting on 
transient_attributes deletions

crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition: sntest2.dclux.loc: 
State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_IPC_MESSAGE origin=route_message ]

crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition: All 2 cluster nodes 
are eligable to run resources.

pengine[3485]: 2007/06/18_20:49:44 info: log_data_element: process_pe_message: 
[generation] <cib admin_epoch="0" epoch="1" num_updates="24" node_fencing="yes" 
generated="true" have_quorum="true" ignore_dtd="false" num_peers="2" 
ccm_transition="2" cib_feature_revision="1.3" 
dc_uuid="db1c3a49-0f90-424a-8f09-c85af7ed83a8"/>

pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default value 
'stop' for cluster option 'no-quorum-policy'

pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default value 
'60s' for cluster option 'cluster-delay'

pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default value 
'-1' for cluster option 'pe-error-series-max'

pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default value 
'-1' for cluster option 'pe-warn-series-max'

pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default value 
'-1' for cluster option 'pe-input-series-max'

pengine[3485]: 2007/06/18_20:49:44 notice: cluster_option: Using default value 
'true' for cluster option 'startup-fencing'

pengine[3485]: 2007/06/18_20:49:44 info: determine_online_status: Node 
sntest2.dclux.loc is online

pengine[3485]: 2007/06/18_20:49:44 info: native_print: IPaddr_192_168_1_168 
(heartbeat::ocf:IPaddr): Stopped

pengine[3485]: 2007/06/18_20:49:44 notice: StartRsc: sntest2.dclux.loc Start 
IPaddr_192_168_1_168

pengine[3485]: 2007/06/18_20:49:44 notice: Recurring: sntest2.dclux.loc 
IPaddr_192_168_1_168_monitor_5000

crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition: sntest2.dclux.loc: 
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=route_message ]

tengine[3484]: 2007/06/18_20:49:44 info: unpack_graph: Unpacked transition 5: 2 
actions in 2 synapses

tengine[3484]: 2007/06/18_20:49:44 info: send_rsc_command: Initiating action 3: 
IPaddr_192_168_1_168_start_0 on sntest2.dclux.loc

crmd[3472]: 2007/06/18_20:49:44 info: do_lrm_rsc_op: Performing 
op=IPaddr_192_168_1_168_start_0 key=3:5:be5645bb-9fc6-4331-9bce-6cad3a7d88f1)

pengine[3485]: 2007/06/18_20:49:44 info: process_pe_message: Transition 5: 
PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-149.bz2

IPaddr[3694]: 2007/06/18_20:49:44 INFO: Using calculated netmask for 
192.168.1.168: 255.255.255.0

IPaddr[3694]: 2007/06/18_20:49:44 DEBUG: Using calculated broadcast for 
192.168.1.168: 192.168.1.255

IPaddr[3694]: 2007/06/18_20:49:44 INFO: eval /sbin/ifconfig eth0:0 
192.168.1.168 netmask 255.255.255.0 broadcast 192.168.1.255

IPaddr[3694]: 2007/06/18_20:49:44 DEBUG: Sending Gratuitous Arp for 
192.168.1.168 on eth0:0 [eth0]

crmd[3472]: 2007/06/18_20:49:44 info: process_lrm_event: LRM operation 
IPaddr_192_168_1_168_start_0 (call=7, rc=0) complete

crmd[3472]: 2007/06/18_20:49:44 info: append_restart_list: Resource 
IPaddr_192_168_1_168 does not support reloads

cib[3468]: 2007/06/18_20:49:44 info: cib_diff_notify: Update (client: 3472, 
call:37): 0.1.24 -> 0.1.25 (ok)

tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Processing diff 
(cib_update): 0.1.24 -> 0.1.25

tengine[3484]: 2007/06/18_20:49:44 info: match_graph_event: Action 
IPaddr_192_168_1_168_start_0 (3) confirmed on 
db1c3a49-0f90-424a-8f09-c85af7ed83a8

tengine[3484]: 2007/06/18_20:49:44 info: send_rsc_command: Initiating action 4: 
IPaddr_192_168_1_168_monitor_5000 on sntest2.dclux.loc

crmd[3472]: 2007/06/18_20:49:44 info: do_lrm_rsc_op: Performing 
op=IPaddr_192_168_1_168_monitor_5000 
key=4:5:be5645bb-9fc6-4331-9bce-6cad3a7d88f1)

cib[3763]: 2007/06/18_20:49:44 info: write_cib_contents: Wrote version 0.1.25 
of the CIB to disk (digest: c564f9c3ff4c6ebfa2da07a1487e3195)

crmd[3472]: 2007/06/18_20:49:44 info: process_lrm_event: LRM operation 
IPaddr_192_168_1_168_monitor_5000 (call=8, rc=0) complete

cib[3468]: 2007/06/18_20:49:44 info: cib_diff_notify: Update (client: 3472, 
call:38): 0.1.25 -> 0.1.26 (ok)

tengine[3484]: 2007/06/18_20:49:44 info: te_update_diff: Processing diff 
(cib_update): 0.1.25 -> 0.1.26

tengine[3484]: 2007/06/18_20:49:44 info: match_graph_event: Action 
IPaddr_192_168_1_168_monitor_5000 (4) confirmed on 
db1c3a49-0f90-424a-8f09-c85af7ed83a8

tengine[3484]: 2007/06/18_20:49:44 info: run_graph: Transition 5: (Complete=2, 
Pending=0, Fired=0, Skipped=0, Incomplete=0)

tengine[3484]: 2007/06/18_20:49:44 info: notify_crmd: Transition 5 status: 
te_complete - <null>

crmd[3472]: 2007/06/18_20:49:44 info: do_state_transition: sntest2.dclux.loc: 
State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_IPC_MESSAGE origin=route_message ]

cib[3778]: 2007/06/18_20:49:44 info: write_cib_contents: Wrote version 0.1.26 
of the CIB to disk (digest: b4d9fdb02151cfe83f146108fab56b8c)

cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm

crmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm

cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: no mbr_track info 
cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm

crmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: no mbr_track info

cib[3468]: 2007/06/18_20:49:46 info: mem_handle_event: instance=3, nodes=1, 
new=0, lost=1, n_idx=0, new_idx=1, old_idx=4

cib[3468]: 2007/06/18_20:49:46 info: cib_ccm_msg_callback: LOST: 
sntest1.dclux.loc

cib[3468]: 2007/06/18_20:49:46 info: cib_ccm_msg_callback: PEER: 
sntest2.dclux.loc

crmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm

crmd[3472]: 2007/06/18_20:49:46 info: mem_handle_event: instance=3, nodes=1, 
new=0, lost=1, n_idx=0, new_idx=1, old_idx=4

crmd[3472]: 2007/06/18_20:49:46 info: crmd_ccm_msg_callback: Quorum lost after 
event=INVALID (id=3)

crmd[3472]: 2007/06/18_20:49:46 info: crmd_ccm_msg_callback: Quorum lost: 
triggering transition (INVALID)

crmd[3472]: 2007/06/18_20:49:46 info: ccm_event_detail: INVALID: trans=3, 
nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=4

crmd[3472]: 2007/06/18_20:49:46 info: ccm_event_detail: CURRENT: 
sntest2.dclux.loc [nodeid=2, born=3]

cib[3468]: 2007/06/18_20:49:46 info: cib_diff_notify: Local-only Change 
(client:3472, call: 39): 0.1.26 (ok)

crmd[3472]: 2007/06/18_20:49:46 info: ccm_event_detail: LOST: sntest1.dclux.loc 
[nodeid=1, born=2]

tengine[3484]: 2007/06/18_20:49:46 info: update_abort_priority: Abort priority 
upgraded to 1000000

tengine[3484]: 2007/06/18_20:49:46 info: te_update_diff: Processing diff 
(cib_update): 0.1.26 -> 0.1.26

crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition: sntest2.dclux.loc: 
State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_IPC_MESSAGE origin=route_message ]

crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition: All 1 cluster nodes 
are eligable to run resources.

cib[3779]: 2007/06/18_20:49:46 info: write_cib_contents: Wrote version 0.1.26 
of the CIB to disk (digest: bca029655a297d98a51ac667259e2a29)

pengine[3485]: 2007/06/18_20:49:46 info: log_data_element: process_pe_message: 
[generation] <cib admin_epoch="0" epoch="1" num_updates="26" node_fencing="yes" 
generated="true" have_quorum="false" ignore_dtd="false" num_peers="2" 
ccm_transition="3" cib_feature_revision="1.3" 
dc_uuid="db1c3a49-0f90-424a-8f09-c85af7ed83a8"/>

pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default value 
'stop' for cluster option 'no-quorum-policy'

pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default value 
'60s' for cluster option 'cluster-delay'

pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default value 
'-1' for cluster option 'pe-error-series-max'

pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default value 
'-1' for cluster option 'pe-warn-series-max'

pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default value 
'-1' for cluster option 'pe-input-series-max'

pengine[3485]: 2007/06/18_20:49:46 notice: cluster_option: Using default value 
'true' for cluster option 'startup-fencing'

pengine[3485]: 2007/06/18_20:49:46 WARN: cluster_status: We do not have quorum 
- fencing and resource management disabled

pengine[3485]: 2007/06/18_20:49:46 info: determine_online_status: Node 
sntest2.dclux.loc is online

pengine[3485]: 2007/06/18_20:49:46 info: native_print: IPaddr_192_168_1_168 
(heartbeat::ocf:IPaddr): Started sntest2.dclux.loc

pengine[3485]: 2007/06/18_20:49:46 notice: StopRsc: sntest2.dclux.loc Stop 
IPaddr_192_168_1_168

crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition: sntest2.dclux.loc: 
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=route_message ]

tengine[3484]: 2007/06/18_20:49:46 info: unpack_graph: Unpacked transition 6: 1 
actions in 1 synapses

tengine[3484]: 2007/06/18_20:49:46 info: send_rsc_command: Initiating action 4: 
IPaddr_192_168_1_168_stop_0 on sntest2.dclux.loc

crmd[3472]: 2007/06/18_20:49:46 info: do_lrm_rsc_op: Performing 
op=IPaddr_192_168_1_168_stop_0 key=4:6:be5645bb-9fc6-4331-9bce-6cad3a7d88f1)

pengine[3485]: 2007/06/18_20:49:46 info: process_pe_message: Transition 6: 
PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-150.bz2

crmd[3472]: 2007/06/18_20:49:46 WARN: process_lrm_event: LRM operation 
IPaddr_192_168_1_168_monitor_5000 (call=8, rc=-2) Cancelled

lrmd[3469]: 2007/06/18_20:49:46 info: RA output: 
(IPaddr_192_168_1_168:stop:stderr) SIOCDELRT: No such process

IPaddr[3780]: 2007/06/18_20:49:46 INFO: /sbin/ifconfig eth0:0 192.168.1.168 down

crmd[3472]: 2007/06/18_20:49:46 info: process_lrm_event: LRM operation 
IPaddr_192_168_1_168_stop_0 (call=10, rc=0) complete

cib[3468]: 2007/06/18_20:49:46 info: cib_diff_notify: Update (client: 3472, 
call:41): 0.1.26 -> 0.1.27 (ok)

tengine[3484]: 2007/06/18_20:49:46 info: te_update_diff: Processing diff 
(cib_update): 0.1.26 -> 0.1.27

tengine[3484]: 2007/06/18_20:49:46 info: match_graph_event: Action 
IPaddr_192_168_1_168_stop_0 (4) confirmed on 
db1c3a49-0f90-424a-8f09-c85af7ed83a8

tengine[3484]: 2007/06/18_20:49:46 info: run_graph: Transition 6: (Complete=1, 
Pending=0, Fired=0, Skipped=0, Incomplete=0)

tengine[3484]: 2007/06/18_20:49:46 info: notify_crmd: Transition 6 status: 
te_complete - <null>

crmd[3472]: 2007/06/18_20:49:46 info: do_state_transition: sntest2.dclux.loc: 
State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_IPC_MESSAGE origin=route_message ]

cib[3800]: 2007/06/18_20:49:46 info: write_cib_contents: Wrote version 0.1.27 
of the CIB to disk (digest: d622380a2238356b807686a901cd1b6c)



and then I'm in a state where the resource is nowhere.



What am I missing/doing wrong ?



Thanks in advance,



Benjamin


________________________________
--------------------------------------------------------

This e-mail and any attached files are confidential and intended solely for the 
use of the individual or entity to whom they are addressed. If you have 
received this e-mail by mistake, please notify the sender immediately and 
delete it from your system. You must not copy the message or disclose its 
contents to anyone.

--------------------------------------------------------
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Ipaddr resource unwanted stop after one node dies

Reply via email to