Package: cluster-agents
Version: 1:1.0.3-3.1
Severity: important
Tags: patch
Hello,
The VirtualDomain OCF script fails erroneously on stop
(triggering unjustified fencing of the node where it fails).
The reason for this is that the VirtualDomain_Status function
bails out too quickly (inconditionously) when receiving an empty status
back from 'virsh domstate' and when in the context of a 'Stop' action.
It should loop/wait a further few seconds and keep checking the domain
state a few more times before bailing out and triggering a cascade of
errors which eventually prompt the HA layer to fence the node.
(see attached Syslog output for the chain of events)
A patch to fix this behavior is attached.
Conceptually the same fix has actually been applied upstream.
(see https://github.com/ClusterLabs/resource-agents)
I hope this can help.
Regards,
Cedric Dufour
-- System Information:
Debian Release: 6.0.1
APT prefers stable
APT policy: (990, 'stable')
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.32-5-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages cluster-agents depends on:
ii cluster-glue 1.0.6-1 The reusable cluster components fo
ii libc6 2.11.2-10 Embedded GNU C Library: Shared lib
ii libcluster-glue 1.0.6-1 The reusable cluster components fo
ii libglib2.0-0 2.24.2-1 The GLib library of C routines
ii libnet1 1.1.4-2 library for the construction and h
ii python 2.6.6-3+squeeze6 interactive high-level object-orie
cluster-agents recommends no packages.
cluster-agents suggests no packages.
-- no debconf information
Jun 8 11:36:55 n501 VirtualDomain[28606]: INFO: Issuing graceful shutdown
request for domain VRNSIDU_intranet.
Jun 8 11:36:55 n501 lrmd: [3232]: info: RA output:
(VRNSIDU_intranet-VirtualDomain:stop:stdout) Domain VRNSIDU_intranet is being
shutdown
Jun 8 11:36:55 n501 sSMTP[28598]: SSL connection using DHE_RSA_AES_128_CBC_SHA1
Jun 8 11:36:55 n501 crmd: [3235]: info: match_graph_event: Action
STONITH-mfsys:0_start_0 (8) confirmed on n502.havc (rc=0)
Jun 8 11:36:55 n501 crmd: [3235]: info: te_rsc_command: Initiating action 9:
monitor STONITH-mfsys:0_monitor_60000 on n502.havc
Jun 8 11:36:55 n501 crmd: [3235]: info: te_pseudo_action: Pseudo action 13
fired and confirmed
Jun 8 11:36:56 n501 crmd: [3235]: info: abort_transition_graph:
te_update_diff:146 - Triggered transition abort (complete=0,
tag=transient_attributes, id=n502.havc, magic=NA, cib=0.33.20) : Transient
attribute: update
Jun 8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort priority
upgraded from 0 to 1000000
Jun 8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort action
done superceeded by restart
Jun 8 11:36:56 n501 sSMTP[28598]: Sent mail for [email protected] (221
2.0.0 Bye) uid=0 username=root outbytes=617
Jun 8 11:36:56 n501 crmd: [3235]: info: process_lrm_event: LRM operation
GW.LUSTRE_intranet-MailTo_start_0 (call=246, rc=0, cib-update=1626,
confirmed=true) ok
Jun 8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action
GW.LUSTRE_intranet-MailTo_start_0 (22) confirmed on n501.havc (rc=0)
Jun 8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action
STONITH-mfsys:0_monitor_60000 (9) confirmed on n502.havc (rc=0)
Jun 8 11:37:16 n501 kernel: [1044136.919946] br0: port 3(vnet1) entering
disabled state
Jun 8 11:37:16 n501 kernel: [1044136.950311] device vnet1 left promiscuous mode
Jun 8 11:37:16 n501 kernel: [1044136.950315] br0: port 3(vnet1) entering
disabled state
Jun 8 11:37:16 n501 libvirtd: 11:37:16.887: error :
qemuMonitorCommandWithHandler:255 : cannot send monitor command 'info balloon':
Connection reset by peer
Jun 8 11:37:16 n501 libvirtd: 11:37:16.887: error :
qemuMonitorTextGetBalloonInfo:555 : operation failed: could not query memory
balloon allocation
Jun 8 11:37:16 n501 lrmd: [3232]: info: RA output:
(VRNSIDU_intranet-VirtualDomain:stop:stderr) error: operation failed: could not
query memory balloon allocation
[BUG] Jun 8 11:37:16 n501 VirtualDomain[28606]: ERROR: Virtual domain
VRNSIDU_intranet has no state during stop operation, bailing out.
[BUG] Jun 8 11:37:16 n501 VirtualDomain[28606]: INFO: Issuing forced shutdown
(destroy) request for domain VRNSIDU_intranet.
[BUG] Jun 8 11:37:16 n501 libvirtd: 11:37:16.899: error :
qemudDomainDestroy:4767 : Requested operation is not valid: domain is not
running
[BUG] Jun 8 11:37:16 n501 lrmd: [3232]: info: RA output:
(VRNSIDU_intranet-VirtualDomain:stop:stderr) error: Failed to destroy domain
VRNSIDU_intranet#012error: Requested operation is not valid: domain is not
running
[BUG] Jun 8 11:37:16 n501 crmd: [3235]: info: process_lrm_event: LRM operation
VRNSIDU_intranet-VirtualDomain_stop_0 (call=248, rc=1, cib-update=1627,
confirmed=true) unknown error
[BUG] Jun 8 11:37:16 n501 crmd: [3235]: WARN: status_from_rc: Action 27
(VRNSIDU_intranet-VirtualDomain_stop_0) on n501.havc failed (target: 0 vs. rc:
1): Error
[BUG] Jun 8 11:37:16 n501 crmd: [3235]: WARN: update_failcount: Updating
failcount for VRNSIDU_intranet-VirtualDomain on n501.havc after failed stop:
rc=1 (update=INFINITY, time=1307525836)
Jun 8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph:
match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op,
id=VRNSIDU_intranet-VirtualDomain_stop_0,
magic=0:1;27:1068:0:a09be5db-e85d-429a-a8f7-2e876b
Jun 8 11:37:16 n501 crmd: [3235]: info: match_graph_event: Action
VRNSIDU_intranet-VirtualDomain_stop_0 (27) confirmed on n501.havc (rc=4)
Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush
op to all hosts for: fail-count-VRNSIDU_intranet-VirtualDomain (INFINITY)
Jun 8 11:37:16 n501 crmd: [3235]: info: run_graph:
====================================================
Jun 8 11:37:16 n501 crmd: [3235]: notice: run_graph: Transition 1068
(Complete=18, Pending=0, Fired=0, Skipped=9, Incomplete=0,
Source=/var/lib/pengine/pe-error-5.bz2): Stopped
Jun 8 11:37:16 n501 crmd: [3235]: info: te_graph_trigger: Transition 1068 is
now complete
Jun 8 11:37:16 n501 crmd: [3235]: info: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=notify_crmd ]
Jun 8 11:37:16 n501 crmd: [3235]: info: do_state_transition: All 2 cluster
nodes are eligible to run resources.
Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1628: Requesting
the current CIB: S_POLICY_ENGINE
Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update
513: fail-count-VRNSIDU_intranet-VirtualDomain=INFINITY
Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush
op to all hosts for: last-failure-VRNSIDU_intranet-VirtualDomain (1307525836)
Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update
515: last-failure-VRNSIDU_intranet-VirtualDomain=1307525836
Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke_callback: Invoking the
PE: query=1628, ref=pe_calc-dc-1307525836-1642, seq=40, quorate=1
Jun 8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph:
te_update_diff:146 - Triggered transition abort (complete=1,
tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.24) : Transient
attribute: update
Jun 8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph:
te_update_diff:146 - Triggered transition abort (complete=1,
tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.25) : Transient
attribute: update
Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1629: Requesting
the current CIB: S_POLICY_ENGINE
Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1630: Requesting
the current CIB: S_POLICY_ENGINE
Jun 8 11:37:16 n501 pengine: [3234]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Jun 8 11:37:16 n501 pengine: [3234]: info: unpack_config: Node scores: 'red' =
-INFINITY, 'yellow' = 0, 'green' = 0
Jun 8 11:37:16 n501 pengine: [3234]: info: determine_online_status: Node
n501.havc is online
[BUG] Jun 8 11:37:16 n501 pengine: [3234]: WARN: unpack_rsc_op: Processing
failed op VRNSIDU_intranet-VirtualDomain_stop_0 on n501.havc: unknown error (1)
[BUG] Jun 8 11:37:16 n501 pengine: [3234]: WARN: pe_fence_node: Node n501.havc
will be fenced to recover from resource failure(s)
--- VirtualDomain.orig 2011-06-08 13:53:21.000000000 +0200
+++ VirtualDomain 2011-06-08 14:21:27.000000000 +0200
@@ -147,9 +147,12 @@
}
VirtualDomain_Status() {
+ local loop
rc=$OCF_ERR_GENERIC
status="no state"
+ loop=0
while [ "$status" = "no state" ]; do
+ loop=$(( $loop+1 ))
status="`virsh $VIRSH_OPTIONS domstate $DOMAIN_NAME`"
case "$status" in
"shut off")
@@ -173,10 +176,11 @@
# whenever virsh can't reliably obtain the domain
# state.
status="no state"
- if [ "$__OCF_ACTION" = "stop" ]; then
+ if [ "$__OCF_ACTION" = "stop" -a $loop -gt 5 ]; then
# During the stop operation, we want to bail out
# quickly, so as to be able to force-stop (destroy)
- # the domain if necessary.
+ # the domain if necessary (but not after making sure
+ # we don't eventually get a valid state).
ocf_log error "Virtual domain $DOMAIN_NAME has no state
during stop operation, bailing out."
return $OCF_ERR_GENERIC;
else