Hi Dejan. Sorry to be slow to respond to this. I have done some
testing and everything looks good.
I spent some time tweaking the RA and I added a parameter called
wait_for_reboot (default 5s) to allow us to override the reboot sleep
times (in case it's more than 5 seconds on really loaded hypervisors).
I also cleaned up a few log entries to make them consistent in the RA
and edited your entries for xen status to be a little bit more clear as
to why we think we should be waiting.
I have attached a patch here because I have NO idea how to create a
branch and pull request. If there are links to a good place to start I
may be able to contribute occasionally to some other RAs that I use.
Please let me know what you think.
Thanks for your help
Tom
On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote:
> On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
>> Hi Tom,
>>
>> On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
>>> Some more reading of the source code makes me think the " || [
>>> "$__OCF_ACTION" != "stop" ]; "is not needed.
>> Yes, you're right. I'll drop that part of the if statement. Many
>> thanks for testing.
> Fixed now. The if statement, which was obviously hard to follow,
> got relegated to the monitor function. Which makes the
> Xen_Status_with_Retry really stand for what's happening in there ;-)
>
> Tom, hope you can test again.
>
> Cheers,
>
> Dejan
>
>> Cheers,
>>
>> Dejan
>>
>>> Xen_Status_with_Retry() is only called from Stop and Monitor so we only
>>> need to check if it's a probe. Everything else should be handled in the
>>> case statement in the loop.
>>>
>>> Tom
>>>
>>> On 10/16/2013 05:16 PM, Tom Parker wrote:
>>>> Hi. I think there is an issue with the Updated Xen RA.
>>>>
>>>> I think there is an issue with the if statement here but I am not sure.
>>>> I may be confused about how bash || works but I don't see my servers
>>>> ever entering the loop on a vm disappearing.
>>>>
>>>> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>> return $rc
>>>> fi
>>>>
>>>> Does this not mean that if we run a monitor operation that is not a
>>>> probe we will have:
>>>>
>>>> (ocf_is_probe) return false
>>>> (stop != monitor) return true
>>>> (false || true) return true
>>>>
>>>> which will cause the if statement to return $rc and never enter the loop?
>>>>
>>>> Xen_Status_with_Retry() {
>>>> local rc cnt=5
>>>>
>>>> Xen_Status $1
>>>> rc=$?
>>>> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>> return $rc
>>>> fi
>>>> while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
>>>> case "$__OCF_ACTION" in
>>>> stop)
>>>> ocf_log debug "domain $1 reported as not running, waiting $cnt
>>>> seconds ..."
>>>> ;;
>>>> monitor)
>>>> ocf_log warn "domain $1 reported as not running, but it is
>>>> expected to be running! Retrying for $cnt seconds ..."
>>>> ;;
>>>> *) : not reachable
>>>> ;;
>>>> esac
>>>> sleep 1
>>>> Xen_Status $1
>>>> rc=$?
>>>> let cnt=$((cnt-1))
>>>> done
>>>> return $rc
>>>> }
>>>>
>>>>
>>>>
>>>> On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
>>>>> Hi Tom,
>>>>>
>>>>> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
>>>>>> Hi Dejan
>>>>>>
>>>>>> Just a quick question. I cannot see your new log messages being logged
>>>>>> to syslog
>>>>>>
>>>>>> ocf_log warn "domain $1 reported as not running, but it is expected to
>>>>>> be running! Retrying for $cnt seconds ...
>>>>>>
>>>>>> Do you know where I can set my logging to see warn level messages? I
>>>>>> expected to see them in my testing by default but that does not seem to
>>>>>> be true.
>>>>> You should see them by default. But note that these warnings may
>>>>> not happen, depending on the circumstances on your host. In my
>>>>> experiments they were logged only while the guest was rebooting
>>>>> and then just once or maybe twice. If you have recent
>>>>> resource-agents and crmsh, you can enable operation tracing (with
>>>>> crm resource trace <rsc> monitor <interval>).
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dejan
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>
>>>>>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> I thought, I'll never be bitten by this bug, but I actually was! Now
>>>>>>>> I'm
>>>>>>>> wondering whether the Xen RA sees the guest if you use pygrub, and
>>>>>>>> pygrub is
>>>>>>>> still counting down for actual boot...
>>>>>>>>
>>>>>>>> But the reason why I'm writing is that I think I've discovered another
>>>>>>>> bug in
>>>>>>>> the RA:
>>>>>>>>
>>>>>>>> CRM decided to "recover" the guest VM "v02":
>>>>>>>> [...]
>>>>>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client
>>>>>>>> 14906:
>>>>>>>> pid 19516 exited with return code 7
>>>>>>>> [...]
>>>>>>>> pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started
>>>>>>>> h05)
>>>>>>>> [...]
>>>>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
>>>>>>>> prm_xen_v02_stop_0 on h05 (local)
>>>>>>>> [...]
>>>>>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
>>>>>>>> [...]
>>>>>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client
>>>>>>>> 14906: pid
>>>>>>>> 19552 exited with return code 0
>>>>>>>> [...]
>>>>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
>>>>>>>> prm_xen_v02_start_0 on h05 (local)
>>>>>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
>>>>>>>> [...]
>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error:
>>>>>>>> Domain 'v02'
>>>>>>>> already exists with ID '3'
>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using
>>>>>>>> config file
>>>>>>>> "/etc/xen/vm/v02".
>>>>>>>> [...]
>>>>>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client
>>>>>>>> 14906: pid
>>>>>>>> 19686 exited with return code 1
>>>>>>>> [...]
>>>>>>>> crmd: [14906]: info: process_lrm_event: LRM operation
>>>>>>>> prm_xen_v02_start_0
>>>>>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
>>>>>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0)
>>>>>>>> on h05
>>>>>>>> failed (target: 0 vs. rc: 1): Error
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> As you can clearly see "start" failed, because the guest was found up
>>>>>>>> already!
>>>>>>>> IMHO this is a bug in the RA (SLES11 SP2:
>>>>>>>> resource-agents-3.9.4-0.26.84).
>>>>>>> Yes, I've seen that. It's basically the same issue, i.e. the
>>>>>>> domain being gone for a while and then reappearing.
>>>>>>>
>>>>>>>> I guess the following test is problematic:
>>>>>>>> ---
>>>>>>>> xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
>>>>>>>> rc=$?
>>>>>>>> if [ $rc -ne 0 ]; then
>>>>>>>> return $OCF_ERR_GENERIC
>>>>>>>> ---
>>>>>>>> Here "xm create" probably fails if the guest is already created...
>>>>>>> It should fail too. Note that this is a race, but the race is
>>>>>>> anyway caused by the strange behaviour of xen. With the recent
>>>>>>> fix (or workaround) in the RA, this shouldn't be happening.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Dejan
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Ulrich
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> Dejan Muhamedagic <[email protected]> schrieb am 01.10.2013 um
>>>>>>>>>>> 12:24 in
>>>>>>>> Nachricht <[email protected]>:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
>>>>>>>>>> On 2013-10-01T00:53:15, Tom Parker <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for paying attention to this issue (not really a bug) as I am
>>>>>>>>>>> sure I am not the only one with this issue. For now I have set all
>>>>>>>>>>> my
>>>>>>>>>>> VMs to destroy so that the cluster is the only thing managing them
>>>>>>>>>>> but
>>>>>>>>>>> this is not super clean as I get failures in my logs that are not
>>>>>>>>>>> really
>>>>>>>>>>> failures.
>>>>>>>>>> It is very much a severe bug.
>>>>>>>>>>
>>>>>>>>>> The Xen RA has gained a workaround for this now, but we're also
>>>>>>>>>> pushing
>>>>>>>>> Take a look here:
>>>>>>>>>
>>>>>>>>> https://github.com/ClusterLabs/resource-agents/pull/314
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Dejan
>>>>>>>>>
>>>>>>>>>> the Xen team (where the real problem is) to investigate and fix.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Lars
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Architect Storage/HA
>>>>>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
>>>>>>>>>> Imendörffer,
>>>>>>>>> HRB 21284 (AG Nürnberg)
>>>>>>>>>> "Experience is the name everyone gives to their mistakes." -- Oscar
>>>>>>>>>> Wilde
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>>> _______________________________________________
>>>>>>>>> Linux-HA mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>>> _______________________________________________
>>>>>>>> Linux-HA mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> [email protected]
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
--- Xen.Dejan 2013-10-17 22:03:05.000000000 -0400
+++ Xen 2013-10-18 13:25:06.000000000 -0400
@@ -20,6 +20,12 @@
# of virtual machine
# OCF_RESKEY_reserved_Dom0_memory
# minimum memory reserved for domain 0
+# OCF_RESKEY_shutdown_acpi
+# Allow xen to send ACPI power events to the domU
+# for hvm guests.
+# OCF_RESKEY_wait_for_reboot
+# How long to wait for a machine to reboot before
+# declaring it really dead. Default is 5 seconds.
# OCF_RESKEY_monitor_scripts
# scripts to monitor services within the
# virtual domain
@@ -44,6 +50,7 @@
: ${OCF_RESKEY_shutdown_acpi=0}
: ${OCF_RESKEY_allow_mem_management=0}
: ${OCF_RESKEY_reserved_Dom0_memory=512}
+: ${OCF_RESKEY_wait_for_reboot=5}
meta_data() {
cat <<END
@@ -163,6 +170,25 @@
<content type="string" default="" />
</parameter>
+<parameter name="wait_for_reboot" unique="0" required="0">
+<content type="string" default="5" />
+<shortdesc lang="en">
+How long to wait for a guest to reboot
+</shortdesc>
+<longdesc lang="en">
+If the guest is rebooting, it may completely disappear from the
+list of defined guests, thus xm/xen-list would return with not
+running; apparently, this period lasts only for a second or
+two
+
+If a monitor status returns not running, then test status
+again for wait_for_reboot seconds (perhaps it'll show up).
+
+NOTE: This timer increases the amount of time the cluster will
+wait before declaring a VM dead and recovering it.
+</longdesc>
+</parameter>
+
</parameters>
<actions>
@@ -216,17 +242,18 @@
# If a status returns not running, then test status
# again for 5 times (perhaps it'll show up)
Xen_Status_with_Retry() {
- local rc cnt=5
+ local rc
+ local cnt=$OCF_RESKEY_wait_for_reboot
Xen_Status $1
rc=$?
while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
case "$__OCF_ACTION" in
stop)
- ocf_log debug "domain $1 reported as not running, waiting $cnt seconds ..."
+ ocf_log debug "Xen domain $1 appears to be stopped but may be rebooting. Waiting $cnt seconds ..."
;;
monitor)
- ocf_log warn "domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ..."
+ ocf_log warn "Xen domain $1 appears to be stopped but may be rebooting. Retrying for $cnt seconds ..."
;;
*) : not reachable
;;
@@ -255,7 +282,7 @@
for DOM in ${RUNNING}; do
xm mem-set ${DOM} ${NEWMEM}
done
- ocf_log info "Adjusted memory to: $NEWMEM, for the following $RUNCNT domains: $RUNNING"
+ ocf_log info "Adjusted memory to: $NEWMEM for the following $RUNCNT domains: $RUNNING"
fi
}
@@ -356,7 +383,7 @@
fi
while Xen_Status $dom && [ "$timeout" -gt 0 ]; do
- ocf_log debug "$dom still not stopped. Waiting..."
+ ocf_log debug "Xen domain $dom still not stopped. Waiting..."
timeout=$((timeout-1))
sleep 1
done
@@ -406,13 +433,13 @@
target_addr="$target_node"
if Xen_Status ${DOMAIN_NAME}; then
- ocf_log info "$DOMAIN_NAME: Starting xm migrate to $target_node"
+ ocf_log info "Xen domain $DOMAIN_NAME: Starting xm migrate to $target_node"
if [ -n "$target_attr" ]; then
nodevalue=`crm_attribute --type nodes --node-uname $target_node --attr-name $target_attr --get-value -q`
if [ -n "${nodevalue}" -a "${nodevalue}" != "(null)" ]; then
target_addr="$nodevalue"
- ocf_log info "$DOMAIN_NAME: $target_node is using address $target_addr"
+ ocf_log info "Xen domain $DOMAIN_NAME: $target_node is using address $target_addr"
fi
fi
@@ -420,15 +447,15 @@
rc=$?
if [ $rc -ne 0 ]; then
- ocf_log err "$DOMAIN_NAME: xm migrate to $target_node failed: $rc"
+ ocf_log err "Xen domain $DOMAIN_NAME: xm migrate to $target_node failed: $rc"
return $OCF_ERR_GENERIC
else
Xen_Adjust_Memory 0
- ocf_log info "$DOMAIN_NAME: xm migrate to $target_node succeeded."
+ ocf_log info "Xen domain $DOMAIN_NAME: xm migrate to $target_node succeeded."
return $OCF_SUCCESS
fi
else
- ocf_log err "$DOMAIN_NAME: migrate_to: Not active locally!"
+ ocf_log err "Xen domain $DOMAIN_NAME: migrate_to: Not active locally!"
return $OCF_ERR_GENERIC
fi
}
@@ -443,17 +470,17 @@
fi
while ! Xen_Status ${DOMAIN_NAME} && [ $timeout -gt 0 ]; do
- ocf_log debug "$DOMAIN_NAME: Not yet active locally, waiting (timeout: ${timeout}s)"
+ ocf_log debug "Xen domain $DOMAIN_NAME: Not yet active locally, waiting (timeout: ${timeout}s)"
timeout=$((timeout-1))
sleep 1
done
if Xen_Status ${DOMAIN_NAME}; then
Xen_Adjust_Memory 0
- ocf_log info "$DOMAIN_NAME: Active locally, migration successful"
+ ocf_log info "Xen domain $DOMAIN_NAME: Active locally, migration successful"
return $OCF_SUCCESS
else
- ocf_log err "$DOMAIN_NAME: Not active locally, migration failed!"
+ ocf_log err "Xen domain $DOMAIN_NAME: Not active locally, migration failed!"
return $OCF_ERR_GENERIC
fi
}
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems