Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Tom Parker Fri, 18 Oct 2013 10:31:24 -0700

Hi Dejan.  Sorry to be slow to respond to this.  I have done some
testing and everything looks good.


I spent some time tweaking the RA and I added a parameter called
wait_for_reboot (default 5s) to allow us to override the reboot sleep
times (in case it's more than 5 seconds on really loaded hypervisors). 
I also cleaned up a few log entries to make them consistent in the RA
and edited your entries for xen status to be a little bit more clear as
to why we think we should be waiting. 

I have attached a patch here because I have NO idea how to create a
branch and pull request.  If there are links to a good place to start I
may be able to contribute occasionally to some other RAs that I use.

Please let me know what you think.

Thanks for your help

Tom


On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote:
> On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
>> Hi Tom,
>>
>> On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
>>> Some more reading of the source code makes me think the " || [
>>> "$__OCF_ACTION" != "stop" ]; "is not needed. 
>> Yes, you're right. I'll drop that part of the if statement. Many
>> thanks for testing.
> Fixed now. The if statement, which was obviously hard to follow,
> got relegated to the monitor function.  Which makes the
> Xen_Status_with_Retry really stand for what's happening in there ;-)
>
> Tom, hope you can test again.
>
> Cheers,
>
> Dejan
>
>> Cheers,
>>
>> Dejan
>>
>>> Xen_Status_with_Retry() is only called from Stop and Monitor so we only
>>> need to check if it's a probe.  Everything else should be handled in the
>>> case statement in the loop.
>>>
>>> Tom
>>>
>>> On 10/16/2013 05:16 PM, Tom Parker wrote:
>>>> Hi.  I think there is an issue with the Updated Xen RA.
>>>>
>>>> I think there is an issue with the if statement here but I am not sure. 
>>>> I may be confused about how bash || works but I don't see my servers
>>>> ever entering the loop on a vm disappearing.
>>>>
>>>> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>>         return $rc
>>>> fi
>>>>
>>>> Does this not mean that if we run a monitor operation that is not a
>>>> probe we will have:
>>>>
>>>> (ocf_is_probe) return false
>>>> (stop != monitor) return true
>>>> (false || true) return true
>>>>
>>>> which will cause the if statement to return $rc and never enter the loop? 
>>>>
>>>> Xen_Status_with_Retry() {
>>>>   local rc cnt=5
>>>>
>>>>   Xen_Status $1
>>>>   rc=$?
>>>>   if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>>         return $rc
>>>>   fi
>>>>   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
>>>>         case "$__OCF_ACTION" in
>>>>         stop)
>>>>           ocf_log debug "domain $1 reported as not running, waiting $cnt
>>>> seconds ..."
>>>>           ;;
>>>>         monitor)
>>>>           ocf_log warn "domain $1 reported as not running, but it is
>>>> expected to be running! Retrying for $cnt seconds ..."
>>>>           ;;
>>>>         *) : not reachable
>>>>                 ;;
>>>>         esac
>>>>         sleep 1
>>>>         Xen_Status $1
>>>>         rc=$?
>>>>         let cnt=$((cnt-1))
>>>>   done
>>>>   return $rc
>>>> }
>>>>
>>>>
>>>>
>>>> On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
>>>>> Hi Tom,
>>>>>
>>>>> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
>>>>>> Hi Dejan
>>>>>>
>>>>>> Just a quick question.  I cannot see your new log messages being logged
>>>>>> to syslog
>>>>>>
>>>>>> ocf_log warn "domain $1 reported as not running, but it is expected to
>>>>>> be running! Retrying for $cnt seconds ...
>>>>>>
>>>>>> Do you know where I can set my logging to see warn level messages?  I
>>>>>> expected to see them in my testing by default but that does not seem to
>>>>>> be true.
>>>>> You should see them by default. But note that these warnings may
>>>>> not happen, depending on the circumstances on your host. In my
>>>>> experiments they were logged only while the guest was rebooting
>>>>> and then just once or maybe twice. If you have recent
>>>>> resource-agents and crmsh, you can enable operation tracing (with
>>>>> crm resource trace <rsc> monitor <interval>).
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dejan
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>
>>>>>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> I thought, I'll never be bitten by this bug, but I actually was! Now 
>>>>>>>> I'm
>>>>>>>> wondering whether the Xen RA sees the guest if you use pygrub, and 
>>>>>>>> pygrub is
>>>>>>>> still counting down for actual boot...
>>>>>>>>
>>>>>>>> But the reason why I'm writing is that I think I've discovered another 
>>>>>>>> bug in
>>>>>>>> the RA:
>>>>>>>>
>>>>>>>> CRM decided to "recover" the guest VM "v02":
>>>>>>>> [...]
>>>>>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 
>>>>>>>> 14906:
>>>>>>>> pid 19516 exited with return code 7
>>>>>>>> [...]
>>>>>>>>  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started 
>>>>>>>> h05)
>>>>>>>> [...]
>>>>>>>>  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
>>>>>>>> prm_xen_v02_stop_0 on h05 (local)
>>>>>>>> [...]
>>>>>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
>>>>>>>> [...]
>>>>>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 
>>>>>>>> 14906: pid
>>>>>>>> 19552 exited with return code 0
>>>>>>>> [...]
>>>>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
>>>>>>>> prm_xen_v02_start_0 on h05 (local)
>>>>>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
>>>>>>>> [...]
>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: 
>>>>>>>> Domain 'v02'
>>>>>>>> already exists with ID '3'
>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using 
>>>>>>>> config file
>>>>>>>> "/etc/xen/vm/v02".
>>>>>>>> [...]
>>>>>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 
>>>>>>>> 14906: pid
>>>>>>>> 19686 exited with return code 1
>>>>>>>> [...]
>>>>>>>> crmd: [14906]: info: process_lrm_event: LRM operation 
>>>>>>>> prm_xen_v02_start_0
>>>>>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
>>>>>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) 
>>>>>>>> on h05
>>>>>>>> failed (target: 0 vs. rc: 1): Error
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> As you can clearly see "start" failed, because the guest was found up 
>>>>>>>> already!
>>>>>>>> IMHO this is a bug in the RA (SLES11 SP2: 
>>>>>>>> resource-agents-3.9.4-0.26.84).
>>>>>>> Yes, I've seen that. It's basically the same issue, i.e. the
>>>>>>> domain being gone for a while and then reappearing.
>>>>>>>
>>>>>>>> I guess the following test is problematic:
>>>>>>>> ---
>>>>>>>>   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
>>>>>>>>   rc=$?
>>>>>>>>   if [ $rc -ne 0 ]; then
>>>>>>>>     return $OCF_ERR_GENERIC
>>>>>>>> ---
>>>>>>>> Here "xm create" probably fails if the guest is already created...
>>>>>>> It should fail too. Note that this is a race, but the race is
>>>>>>> anyway caused by the strange behaviour of xen. With the recent
>>>>>>> fix (or workaround) in the RA, this shouldn't be happening.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Dejan
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Ulrich
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> Dejan Muhamedagic <[email protected]> schrieb am 01.10.2013 um 
>>>>>>>>>>> 12:24 in
>>>>>>>> Nachricht <[email protected]>:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
>>>>>>>>>> On 2013-10-01T00:53:15, Tom Parker <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for paying attention to this issue (not really a bug) as I am
>>>>>>>>>>> sure I am not the only one with this issue.  For now I have set all 
>>>>>>>>>>> my
>>>>>>>>>>> VMs to destroy so that the cluster is the only thing managing them 
>>>>>>>>>>> but
>>>>>>>>>>> this is not super clean as I get failures in my logs that are not 
>>>>>>>>>>> really
>>>>>>>>>>> failures.
>>>>>>>>>> It is very much a severe bug.
>>>>>>>>>>
>>>>>>>>>> The Xen RA has gained a workaround for this now, but we're also 
>>>>>>>>>> pushing
>>>>>>>>> Take a look here:
>>>>>>>>>
>>>>>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Dejan
>>>>>>>>>
>>>>>>>>>> the Xen team (where the real problem is) to investigate and fix.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>     Lars
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Architect Storage/HA
>>>>>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
>>>>>>>>>> Imendörffer,
>>>>>>>>> HRB 21284 (AG Nürnberg)
>>>>>>>>>> "Experience is the name everyone gives to their mistakes." -- Oscar 
>>>>>>>>>> Wilde
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>> [email protected] 
>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>> _______________________________________________
>>>>>>>>> Linux-HA mailing list
>>>>>>>>> [email protected] 
>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>> _______________________________________________
>>>>>>>> Linux-HA mailing list
>>>>>>>> [email protected]
>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> [email protected]
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

--- Xen.Dejan	2013-10-17 22:03:05.000000000 -0400
+++ Xen	2013-10-18 13:25:06.000000000 -0400
@@ -20,6 +20,12 @@
 #			of virtual machine
 #		OCF_RESKEY_reserved_Dom0_memory
 #			minimum memory reserved for domain 0
+#               OCF_RESKEY_shutdown_acpi
+#                       Allow xen to send ACPI power events to the domU
+#                       for hvm guests.
+#               OCF_RESKEY_wait_for_reboot
+#                       How long to wait for a machine to reboot before
+#                       declaring it really dead.  Default is 5 seconds.
 #		OCF_RESKEY_monitor_scripts
 #			scripts to monitor services within the
 #			virtual domain
@@ -44,6 +50,7 @@
 : ${OCF_RESKEY_shutdown_acpi=0}
 : ${OCF_RESKEY_allow_mem_management=0}
 : ${OCF_RESKEY_reserved_Dom0_memory=512}
+: ${OCF_RESKEY_wait_for_reboot=5}
 
 meta_data() {
 	cat <<END
@@ -163,6 +170,25 @@
 <content type="string" default="" />
 </parameter>
 
+<parameter name="wait_for_reboot" unique="0" required="0">
+<content type="string" default="5" />
+<shortdesc lang="en">
+How long to wait for a guest to reboot
+</shortdesc>
+<longdesc lang="en">
+If the guest is rebooting, it may completely disappear from the
+list of defined guests, thus xm/xen-list would return with not
+running; apparently, this period lasts only for a second or
+two
+
+If a monitor status returns not running, then test status
+again for wait_for_reboot seconds (perhaps it'll show up).
+
+NOTE: This timer increases the amount of time the cluster will
+wait before declaring a VM dead and recovering it.
+</longdesc>
+</parameter>
+
 </parameters>
 
 <actions>
@@ -216,17 +242,18 @@
 # If a status returns not running, then test status
 # again for 5 times (perhaps it'll show up)
 Xen_Status_with_Retry() {
-  local rc cnt=5
+  local rc 
+  local cnt=$OCF_RESKEY_wait_for_reboot
 
   Xen_Status $1
   rc=$?
   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
 	case "$__OCF_ACTION" in
 	stop)
-	  ocf_log debug "domain $1 reported as not running, waiting $cnt seconds ..."
+	  ocf_log debug "Xen domain $1 appears to be stopped but may be rebooting.  Waiting $cnt seconds ..."
 	  ;;
 	monitor)
-	  ocf_log warn "domain $1 reported as not running, but it is expected to be running! Retrying for $cnt seconds ..."
+	  ocf_log warn "Xen domain $1 appears to be stopped but may be rebooting.  Retrying for $cnt seconds ..."
 	  ;;
 	*) : not reachable
 		;;
@@ -255,7 +282,7 @@
       for DOM in ${RUNNING}; do
         xm mem-set ${DOM} ${NEWMEM} 
       done
-      ocf_log info "Adjusted memory to: $NEWMEM, for the following $RUNCNT domains: $RUNNING"
+      ocf_log info "Adjusted memory to: $NEWMEM for the following $RUNCNT domains: $RUNNING"
     fi
 }
 
@@ -356,7 +383,7 @@
       fi
           
       while Xen_Status $dom && [ "$timeout" -gt 0 ]; do
-        ocf_log debug "$dom still not stopped. Waiting..."
+        ocf_log debug "Xen domain $dom still not stopped. Waiting..."
         timeout=$((timeout-1))
         sleep 1
       done
@@ -406,13 +433,13 @@
   target_addr="$target_node"
  
   if Xen_Status ${DOMAIN_NAME}; then
-    ocf_log info "$DOMAIN_NAME: Starting xm migrate to $target_node"
+    ocf_log info "Xen domain $DOMAIN_NAME: Starting xm migrate to $target_node"
     
     if [ -n "$target_attr" ]; then
 	  nodevalue=`crm_attribute --type nodes --node-uname $target_node --attr-name $target_attr --get-value -q`
 	  if [ -n "${nodevalue}" -a "${nodevalue}" != "(null)" ]; then
 	  	target_addr="$nodevalue"
-		ocf_log info "$DOMAIN_NAME: $target_node is using address $target_addr"
+		ocf_log info "Xen domain $DOMAIN_NAME: $target_node is using address $target_addr"
 	  fi
     fi
 
@@ -420,15 +447,15 @@
 
     rc=$?
     if [ $rc -ne 0 ]; then
-      ocf_log err "$DOMAIN_NAME: xm migrate to $target_node failed: $rc"
+      ocf_log err "Xen domain $DOMAIN_NAME: xm migrate to $target_node failed: $rc"
       return $OCF_ERR_GENERIC
     else 
       Xen_Adjust_Memory 0
-      ocf_log info "$DOMAIN_NAME: xm migrate to $target_node succeeded."
+      ocf_log info "Xen domain $DOMAIN_NAME: xm migrate to $target_node succeeded."
       return $OCF_SUCCESS
     fi
   else
-    ocf_log err "$DOMAIN_NAME: migrate_to: Not active locally!"
+    ocf_log err "Xen domain $DOMAIN_NAME: migrate_to: Not active locally!"
     return $OCF_ERR_GENERIC
   fi
 }
@@ -443,17 +470,17 @@
   fi
 
   while ! Xen_Status ${DOMAIN_NAME} && [ $timeout -gt 0 ]; do
-    ocf_log debug "$DOMAIN_NAME: Not yet active locally, waiting (timeout: ${timeout}s)"
+    ocf_log debug "Xen domain $DOMAIN_NAME: Not yet active locally, waiting (timeout: ${timeout}s)"
     timeout=$((timeout-1))
     sleep 1
   done
 
   if Xen_Status ${DOMAIN_NAME}; then
     Xen_Adjust_Memory 0
-    ocf_log info "$DOMAIN_NAME: Active locally, migration successful"
+    ocf_log info "Xen domain $DOMAIN_NAME: Active locally, migration successful"
     return $OCF_SUCCESS
   else
-    ocf_log err "$DOMAIN_NAME: Not active locally, migration failed!"
+    ocf_log err "Xen domain $DOMAIN_NAME: Not active locally, migration failed!"
     return $OCF_ERR_GENERIC
   fi
 }

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to