RE: [Linux-HA] Fencing prevents resource from failing over

abhishek.bagchi Mon, 26 Nov 2007 05:39:22 -0800

Hi Andrew,
I just modified my stonith device to work in both online and offline
mode. The stonith operation (standby -> active) is successful with the
active node cable unplugged and it seems the standby node tries to start
the resource, but fails. Log is attached. But there's not enough logs to
find out whats going on.  It just prints:


pengine[15900]: 2007/11/26_18:11:27 WARN: unpack_rsc_op: Processing
failed op (Proxy_10_114_31_238_start_0) for Proxy_10_114_31_238 on
standby
pengine[15900]: 2007/11/26_18:11:27 WARN: unpack_rsc_op: Handling failed
start for Proxy_10_114_31_238 on standby 

Is there a way to enable more log messages in HA at run-time? The debug
log and regular log seem to have the same amount of messages. 

Thanks again,
Abhi

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Andrew Beekhof
Sent: Monday, November 26, 2007 2:59 PM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Fencing prevents resource from failing over


On Nov 26, 2007, at 9:56 AM, <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]  > wrote:

>
> Thanks Andrew,
> My comments are inline...
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Andrew 
> Beekhof
> Sent: Monday, November 26, 2007 1:44 PM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] Fencing prevents resource from failing over
>
>
> On Nov 26, 2007, at 6:25 AM, <[EMAIL PROTECTED]> 
> <[EMAIL PROTECTED]  > wrote:
>
>>
>> Hi,
>> I've a 2 node active/passive cluster ( active node=>active , passive
>> node=>standby) using heartbeat 2.0.8 . I recently enabled stonith .
>> The
>> stonith device is an rsh device that tries to restart the cluster 
>> node.
>> However, something that used to work with stonith disabled has 
>> stopped
>
>> working now ; Node failover on network cable disconnection. I believe

>> since the stonith device uses the network, the stonith fails and 
>> hence
>
>> the resource is left wherever it was running.
>
> correct.  the cluster will not start anything until it can verify the 
> node is truly dead (with a successful stonith operation) this is how a

> stonith enabled cluster is supposed to work and is why IP-based 
> stonith modules are not a great idea.
>
>
>
>> Can anyone please help resolve this problem (this is probably not a 
>> problem and this is how stonith is expected to work )? I would like 
>> to
>
>> know if there's anyway to tell the passive (currently active node) to

>> give up trying to stonith and then start the resource.
>
> by design - no.
>
>> I've attached my
>> cib file and logs from the passive when cable is disconnected.
>> I've no problem both nodes running the resource as active is anyway 
>> cut-off from network and can't do any damage.
>
> if thats truly the case, then you may not need stonith.
>
> ABHI: But, if the Active comes online again it's a very bad thing for 
> both nodes to be running the resources.

the crm will detect that and stop one of them.
however there will always be a period of time (even with your proposal
below) where they are both active and both connected to the network

> Can we configure two stonith
> devices and make the node think stonith is successful if either of the

> stonith operations return success.Is their some kind of resource 
> constraint that I can use in this case ?
> 1. Online stonith device: That uses IP to reset the other node.
> 2. Offline stonith device: That is just dummy and on reset always 
> returns success.

if you're lucky, this might work 9 times out of 10.
but its likely that when it doesn't work, that its going to _really_
hurt you.

"tricking" the cluster almost always leads to pain.


my advice... get a real stonith device...

>> The standby log seems to
>> say it has quorum
>
> 2-node clusters always have quorum, so the value is meaningless...
>
>> but it makes me wonder why it doesnt start the resources , inspite of

>> the following evident from the logs.
>>
>> 1. Standby marks active unclean
>> 2. Standby has quorum
>> 3. Standby tries to move resources back to standby
>>
>>
>> Thanks in advance,
>> Abhi.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> The information contained in this electronic message and any 
>> attachments to this message are intended for the exclusive use of the
>> addressee(s) and may contain proprietary, confidential or privileged 
>> information. If you are not the intended recipient, you should not 
>> disseminate, distribute or copy this e-mail. Please notify the sender

>> immediately and destroy all copies of this message and any 
>> attachments.
>>
>> WARNING: Computer viruses can be transmitted via email. The recipient

>> should check this email and any attachments for the presence of 
>> viruses. The company accepts no liability for any damage caused by 
>> any
>
>> virus transmitted by this email.
>>
>> www.wipro.com<ha-log-
>> standby.txt><cib.xml>_______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
> The information contained in this electronic message and any 
> attachments to this message are intended for the exclusive use of the 
> addressee(s) and may contain proprietary, confidential or privileged 
> information. If you are not the intended recipient, you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately and destroy all copies of this message and any 
> attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient 
> should check this email and any attachments for the presence of 
> viruses. The company accepts no liability for any damage caused by any

> virus transmitted by this email.
>
> www.wipro.com
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems



The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.
 
www.wipro.com

[EMAIL PROTECTED] proxy]# heartbeat[15399]: 2007/11/26_18:11:23 WARN: node 
active: is dead
heartbeat[15399]: 2007/11/26_18:11:23 info: Link active:eth0 dead.
crmd[15413]: 2007/11/26_18:11:23 notice: crmd_ha_status_callback: Status 
update: Node active now has status [dead]
ccm[15408]: 2007/11/26_18:11:23 info: Break tie for 2 nodes cluster
cib[15409]: 2007/11/26_18:11:23 info: cib_diff_notify: Local-only Change 
(client:15413, call: 98): 0.590.7031 (ok)
crmd[15413]: 2007/11/26_18:11:23 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
crmd[15413]: 2007/11/26_18:11:23 info: mem_handle_event: no mbr_track info
crmd[15413]: 2007/11/26_18:11:23 info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
crmd[15413]: 2007/11/26_18:11:23 info: mem_handle_event: instance=3, nodes=1, 
new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
cib[15409]: 2007/11/26_18:11:23 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
tengine[15899]: 2007/11/26_18:11:23 info: te_update_diff: Processing diff 
(cib_update): 0.590.7031 -> 0.590.7031
crmd[15413]: 2007/11/26_18:11:23 info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=3)
cib[15409]: 2007/11/26_18:11:23 info: mem_handle_event: no mbr_track info
tengine[15899]: 2007/11/26_18:11:23 WARN: match_down_event: No match for 
shutdown action on 6ef6bc8d-de62-49aa-8ed3-e4fa300cff8c
crmd[15413]: 2007/11/26_18:11:23 info: ccm_event_detail: NEW MEMBERSHIP: 
trans=3, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
cib[15409]: 2007/11/26_18:11:23 info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
tengine[15899]: 2007/11/26_18:11:23 info: extract_event: Stonith/shutdown of 
6ef6bc8d-de62-49aa-8ed3-e4fa300cff8c not matched
crmd[15413]: 2007/11/26_18:11:23 info: ccm_event_detail: CURRENT: standby 
[nodeid=1, born=3]
cib[15409]: 2007/11/26_18:11:23 info: mem_handle_event: instance=3, nodes=1, 
new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
tengine[15899]: 2007/11/26_18:11:24 info: update_abort_priority: Abort priority 
upgraded to 1000000
crmd[15413]: 2007/11/26_18:11:24 info: ccm_event_detail: LOST:    active 
[nodeid=0, born=2]
cib[15409]: 2007/11/26_18:11:24 info: cib_ccm_msg_callback: LOST: active
tengine[15899]: 2007/11/26_18:11:24 info: te_update_diff: Aborting on 
transient_attributes deletions
crmd[15413]: 2007/11/26_18:11:24 info: do_state_transition: standby: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE 
origin=route_message ]
cib[15409]: 2007/11/26_18:11:24 info: cib_ccm_msg_callback: PEER: standby
crmd[15413]: 2007/11/26_18:11:24 info: do_state_transition: All 1 cluster nodes 
are eligable to run resources.
cib[15409]: 2007/11/26_18:11:24 info: cib_diff_notify: Local-only Change 
(client:15413, call: 99): 0.590.7031 (ok)
tengine[15899]: 2007/11/26_18:11:25 info: te_update_diff: Processing diff 
(cib_update): 0.590.7031 -> 0.590.7031
cib[18146]: 2007/11/26_18:11:25 info: write_cib_contents: Wrote version 
0.590.7031 of the CIB to disk (digest: e99b819c391bf34d8e8506ee20976e61)
pengine[15900]: 2007/11/26_18:11:25 info: log_data_element: process_pe_message: 
[generation] <cib admin_epoch="0" have_quorum="true" ignore_dtd="false" 
num_peers="2" cib_feature_revision="1.3" generated="true" ccm_transition="3" 
dc_uuid="cfd38e2f-2e94-4c49-9068-3aead25c9476" epoch="590" num_updates="7031"/>
pengine[15900]: 2007/11/26_18:11:25 notice: cluster_option: Using default value 
'true' for cluster option 'symmetric-cluster'
pengine[15900]: 2007/11/26_18:11:25 notice: cluster_option: Using default value 
'reboot' for cluster option 'stonith-action'
pengine[15900]: 2007/11/26_18:11:25 notice: cluster_option: Using default value 
'0' for cluster option 'default-resource-failure-stickiness'
pengine[15900]: 2007/11/26_18:11:25 notice: cluster_option: Using default value 
'true' for cluster option 'is-managed-default'
pengine[15900]: 2007/11/26_18:11:25 notice: cluster_option: Using default value 
'60s' for cluster option 'cluster-delay'
pengine[15900]: 2007/11/26_18:11:26 notice: cluster_option: Using default value 
'20s' for cluster option 'default-action-timeout'
pengine[15900]: 2007/11/26_18:11:26 notice: cluster_option: Using default value 
'true' for cluster option 'stop-orphan-resources'
pengine[15900]: 2007/11/26_18:11:26 notice: cluster_option: Using default value 
'true' for cluster option 'stop-orphan-actions'
pengine[15900]: 2007/11/26_18:11:26 notice: cluster_option: Using default value 
'false' for cluster option 'remove-after-stop'
pengine[15900]: 2007/11/26_18:11:26 notice: cluster_option: Using default value 
'-1' for cluster option 'pe-error-series-max'
pengine[15900]: 2007/11/26_18:11:26 notice: cluster_option: Using default value 
'-1' for cluster option 'pe-warn-series-max'
pengine[15900]: 2007/11/26_18:11:26 notice: cluster_option: Using default value 
'-1' for cluster option 'pe-input-series-max'
pengine[15900]: 2007/11/26_18:11:26 notice: cluster_option: Using default value 
'true' for cluster option 'startup-fencing'
pengine[15900]: 2007/11/26_18:11:27 notice: unpack_config: On loss of CCM 
Quorum: Ignore
pengine[15900]: 2007/11/26_18:11:27 WARN: determine_online_status_fencing: Node 
active (6ef6bc8d-de62-49aa-8ed3-e4fa300cff8c) is un-expectedly down
pengine[15900]: 2007/11/26_18:11:27 info: determine_online_status_fencing: 
ha_state=dead, ccm_state=false, crm_state=online, join_state=down, 
expected=member
pengine[15900]: 2007/11/26_18:11:27 WARN: determine_online_status: Node active 
is unclean
pengine[15900]: 2007/11/26_18:11:27 info: determine_online_status: Node standby 
is online
pengine[15900]: 2007/11/26_18:11:27 WARN: unpack_rsc_op: Processing failed op 
(Proxy_10_114_31_238_start_0) for Proxy_10_114_31_238 on standby
pengine[15900]: 2007/11/26_18:11:27 WARN: unpack_rsc_op: Handling failed start 
for Proxy_10_114_31_238 on standby
pengine[15900]: 2007/11/26_18:11:27 info: group_print: Resource Group: proxy_rsc
pengine[15900]: 2007/11/26_18:11:27 info: native_print:     
IPaddr_10_114_31_238(heartbeat::ocf:IPaddr):Started active
pengine[15900]: 2007/11/26_18:11:28 info: native_print:     
Proxy_10_114_31_238(heartbeat::ocf:myOcf):Started active
pengine[15900]: 2007/11/26_18:11:28 info: clone_print: Clone Set: stonith_rsc
pengine[15900]: 2007/11/26_18:11:28 info: native_print:     
Stonith:0(stonith:external/rsh):Started standby
pengine[15900]: 2007/11/26_18:11:28 info: native_print:     
Stonith:1(stonith:external/rsh):Started active
pengine[15900]: 2007/11/26_18:11:28 info: native_color: Combine scores from 
Proxy_10_114_31_238 and IPaddr_10_114_31_238
pengine[15900]: 2007/11/26_18:11:28 WARN: native_color: Resource 
IPaddr_10_114_31_238 cannot run anywhere
pengine[15900]: 2007/11/26_18:11:28 WARN: native_color: Resource 
Proxy_10_114_31_238 cannot run anywhere
pengine[15900]: 2007/11/26_18:11:29 notice: StopRsc:   activeStop 
IPaddr_10_114_31_238
pengine[15900]: 2007/11/26_18:11:29 WARN: custom_action: Action 
IPaddr_10_114_31_238_stop_0 on active is unrunnable (offline)
pengine[15900]: 2007/11/26_18:11:29 WARN: custom_action: Marking node active 
unclean
pengine[15900]: 2007/11/26_18:11:29 notice: StopRsc:   activeStop 
Proxy_10_114_31_238
pengine[15900]: 2007/11/26_18:11:29 WARN: custom_action: Action 
Proxy_10_114_31_238_stop_0 on active is unrunnable (offline)
pengine[15900]: 2007/11/26_18:11:29 WARN: custom_action: Marking node active 
unclean
pengine[15900]: 2007/11/26_18:11:29 WARN: native_color: Resource Stonith:1 
cannot run anywhere
pengine[15900]: 2007/11/26_18:11:30 notice: NoRoleChange: Leave resource 
Stonith:0(standby)
pengine[15900]: 2007/11/26_18:11:30 notice: StopRsc:   activeStop Stonith:1
pengine[15900]: 2007/11/26_18:11:30 WARN: custom_action: Action 
Stonith:1_stop_0 on active is unrunnable (offline)
pengine[15900]: 2007/11/26_18:11:30 WARN: custom_action: Marking node active 
unclean
pengine[15900]: 2007/11/26_18:11:30 WARN: stage6: Scheduling Node active for 
STONITH
pengine[15900]: 2007/11/26_18:11:30 WARN: native_stop_constraints: Stop of 
failed resource IPaddr_10_114_31_238 is implict after active is fenced
pengine[15900]: 2007/11/26_18:11:31 info: native_stop_constraints: Re-creating 
actions for proxy_rsc
pengine[15900]: 2007/11/26_18:11:31 notice: StopRsc:   activeStop 
IPaddr_10_114_31_238
pengine[15900]: 2007/11/26_18:11:31 WARN: custom_action: Action 
IPaddr_10_114_31_238_stop_0 on active is unrunnable (offline)
pengine[15900]: 2007/11/26_18:11:31 WARN: custom_action: Marking node active 
unclean
pengine[15900]: 2007/11/26_18:11:31 notice: StopRsc:   activeStop 
Proxy_10_114_31_238
pengine[15900]: 2007/11/26_18:11:31 WARN: custom_action: Action 
Proxy_10_114_31_238_stop_0 on active is unrunnable (offline)
pengine[15900]: 2007/11/26_18:11:31 WARN: custom_action: Marking node active 
unclean
pengine[15900]: 2007/11/26_18:11:31 WARN: native_stop_constraints: Stop of 
failed resource Proxy_10_114_31_238 is implict after active is fenced
pengine[15900]: 2007/11/26_18:11:32 info: native_stop_constraints: Re-creating 
actions for proxy_rsc
pengine[15900]: 2007/11/26_18:11:32 notice: StopRsc:   activeStop 
IPaddr_10_114_31_238
pengine[15900]: 2007/11/26_18:11:32 WARN: custom_action: Action 
IPaddr_10_114_31_238_stop_0 on active is unrunnable (offline)
pengine[15900]: 2007/11/26_18:11:32 WARN: custom_action: Marking node active 
unclean
pengine[15900]: 2007/11/26_18:11:32 notice: StopRsc:   activeStop 
Proxy_10_114_31_238
pengine[15900]: 2007/11/26_18:11:32 WARN: custom_action: Action 
Proxy_10_114_31_238_stop_0 on active is unrunnable (offline)
pengine[15900]: 2007/11/26_18:11:32 WARN: custom_action: Marking node active 
unclean
pengine[15900]: 2007/11/26_18:11:32 WARN: native_stop_constraints: Stop of 
failed resource Stonith:1 is implict after active is fenced
pengine[15900]: 2007/11/26_18:11:33 info: native_stop_constraints: Re-creating 
actions for stonith_rsc
pengine[15900]: 2007/11/26_18:11:33 notice: NoRoleChange: Leave resource 
Stonith:0(standby)
pengine[15900]: 2007/11/26_18:11:33 notice: StopRsc:   activeStop Stonith:1
pengine[15900]: 2007/11/26_18:11:33 WARN: custom_action: Action 
Stonith:1_stop_0 on active is unrunnable (offline)
pengine[15900]: 2007/11/26_18:11:33 WARN: custom_action: Marking node active 
unclean
crmd[15413]: 2007/11/26_18:11:33 info: do_state_transition: standby: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=route_message ]
pengine[15900]: 2007/11/26_18:11:33 WARN: process_pe_message: Transition 13: 
WARNINGs found during PE processing. PEngine Input stored in: 
/var/lib/heartbeat/pengine/pe-warn-9819.bz2
tengine[15899]: 2007/11/26_18:11:33 info: unpack_graph: Unpacked transition 13: 
6 actions in 6 synapses
pengine[15900]: 2007/11/26_18:11:34 info: process_pe_message: Configuration 
WARNINGs found during PE processing.  Please run "crm_verify -L" to identify 
issues.
tengine[15899]: 2007/11/26_18:11:34 info: te_pseudo_action: Pseudo action 5 
fired and confirmed
pengine[15900]: 2007/11/26_18:11:34 ERROR: subsystem_msg_dispatch: pengine took 
9150ms to complete
tengine[15899]: 2007/11/26_18:11:34 info: te_pseudo_action: Pseudo action 15 
fired and confirmed
tengine[15899]: 2007/11/26_18:11:34 info: te_fence_node: Executing reboot 
fencing operation (17) on active (timeout=30000)
stonithd[15411]: 2007/11/26_18:11:34 info: client tengine [pid: 15899] want a 
STONITH operation RESET to node active.
stonithd[15411]: 2007/11/26_18:11:34 info: stonith_operate_locally::2539: 
sending fencing op (1) for active to device external (rsc_id=Stonith:0, 
pid=18153)
tengine[15899]: 2007/11/26_18:11:34 info: te_pseudo_action: Pseudo action 4 
fired and confirmed
tengine[15899]: 2007/11/26_18:11:35 info: te_pseudo_action: Pseudo action 12 
fired and confirmed
tengine[15899]: 2007/11/26_18:11:35 info: te_pseudo_action: Pseudo action 16 
fired and confirmed
stonithd[15411]: 2007/11/26_18:11:37 info: Succeeded to STONITH the node 
active: optype=1. whodoit: standby
tengine[15899]: 2007/11/26_18:11:37 info: tengine_stonith_callback: call=18153, 
optype=1, node_name=active, result=0, node_list=standby, 
action=17:13:14034b9e-6f46-4522-ab91-7124bdbcd307
tengine[15899]: 2007/11/26_18:11:37 info: run_graph: Transition 13: 
(Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0)
tengine[15899]: 2007/11/26_18:11:37 WARN: notify_crmd: Delaying completion 
until all CIB updates complete
cib[15409]: 2007/11/26_18:11:37 info: cib_diff_notify: Update (client: 15899, 
call:7): 0.590.7031 -> 0.590.7032 (ok)
tengine[15899]: 2007/11/26_18:11:37 info: te_update_diff: Processing diff 
(cib_update): 0.590.7031 -> 0.590.7032
tengine[15899]: 2007/11/26_18:11:37 info: notify_crmd: Transition 13 status: 
te_complete - <null>
crmd[15413]: 2007/11/26_18:11:37 info: do_state_transition: standby: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_IPC_MESSAGE origin=route_message ]
cib[18159]: 2007/11/26_18:11:37 info: write_cib_contents: Wrote version 
0.590.7032 of the CIB to disk (digest: 5631ac039c7e700100873b2cff2fc744)

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

RE: [Linux-HA] Fencing prevents resource from failing over

Reply via email to