Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Dejan Muhamedagic Mon, 21 Oct 2013 06:40:16 -0700

Hi Ulrich!

On Mon, Oct 21, 2013 at 09:28:50AM +0200, Ulrich Windl wrote:
> Hi!
> 
> Basically I think there should be no hard-coded constants whose value depends
> on some performance measurements, like 5s for rebooting a VM.


It's actually not 5s, but the status is run 5 times. If the load
is high, my guess is that the Xen tools used by the RA would
suffer proportionally.

> So I support
> Tom's changes.
> 
> However I noticed:
> 
> +running; apparently, this period lasts only for a second or
> +two
> 
> (missing full stop at end of sentence)

That's at the end of the comment and, typically, comments end
with a carriage return (as is here the case).

> Actually I'd rephrase the description:
> 
> "When the guest is rebooting, there is a short interval where the guest
> completely disappears from "xm list", which, in turn, will cause the monitor
> operation to return a "not running" status. If the guest cannot be found , 
> this
> value will cause some extra delay in the monitor operation to work around the
> problem."
> 
> (I.e. try to describe the effect, not the implementation)

That's the code, so the implementation is described. The very
top of the comment says:

        # If the guest is rebooting, it may completely disappear from the
        # list of defined guests

I was hoping that that was enough of an explanation. Look for
a more thorough description of the cause in the changelog. BTW,
note that this is a _workaround_ and that the thing should
eventually be fixed in Xen.

> And yes, I appreciate consistent log formats also ;-)

That's always welcome, of course. It should also go in a
separate commit.

Thanks,

Dejan

> Regards,
> Ulrich
> 
> >>> Tom Parker <[email protected]> schrieb am 18.10.2013 um 19:30 in
> Nachricht
> <[email protected]>:
> > Hi Dejan.  Sorry to be slow to respond to this.  I have done some
> > testing and everything looks good. 
> > 
> > I spent some time tweaking the RA and I added a parameter called
> > wait_for_reboot (default 5s) to allow us to override the reboot sleep
> > times (in case it's more than 5 seconds on really loaded hypervisors). 
> > I also cleaned up a few log entries to make them consistent in the RA
> > and edited your entries for xen status to be a little bit more clear as
> > to why we think we should be waiting. 
> > 
> > I have attached a patch here because I have NO idea how to create a
> > branch and pull request.  If there are links to a good place to start I
> > may be able to contribute occasionally to some other RAs that I use.
> > 
> > Please let me know what you think.
> > 
> > Thanks for your help
> > 
> > Tom
> > 
> > 
> > On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote:
> >> On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
> >>> Hi Tom,
> >>>
> >>> On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
> >>>> Some more reading of the source code makes me think the " || [
> >>>> "$__OCF_ACTION" != "stop" ]; "is not needed. 
> >>> Yes, you're right. I'll drop that part of the if statement. Many
> >>> thanks for testing.
> >> Fixed now. The if statement, which was obviously hard to follow,
> >> got relegated to the monitor function.  Which makes the
> >> Xen_Status_with_Retry really stand for what's happening in there ;-)
> >>
> >> Tom, hope you can test again.
> >>
> >> Cheers,
> >>
> >> Dejan
> >>
> >>> Cheers,
> >>>
> >>> Dejan
> >>>
> >>>> Xen_Status_with_Retry() is only called from Stop and Monitor so we only
> >>>> need to check if it's a probe.  Everything else should be handled in the
> >>>> case statement in the loop.
> >>>>
> >>>> Tom
> >>>>
> >>>> On 10/16/2013 05:16 PM, Tom Parker wrote:
> >>>>> Hi.  I think there is an issue with the Updated Xen RA.
> >>>>>
> >>>>> I think there is an issue with the if statement here but I am not sure.
> 
> >>>>> I may be confused about how bash || works but I don't see my servers
> >>>>> ever entering the loop on a vm disappearing.
> >>>>>
> >>>>> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
> >>>>>         return $rc
> >>>>> fi
> >>>>>
> >>>>> Does this not mean that if we run a monitor operation that is not a
> >>>>> probe we will have:
> >>>>>
> >>>>> (ocf_is_probe) return false
> >>>>> (stop != monitor) return true
> >>>>> (false || true) return true
> >>>>>
> >>>>> which will cause the if statement to return $rc and never enter the
> loop? 
> >>>>>
> >>>>> Xen_Status_with_Retry() {
> >>>>>   local rc cnt=5
> >>>>>
> >>>>>   Xen_Status $1
> >>>>>   rc=$?
> >>>>>   if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
> >>>>>         return $rc
> >>>>>   fi
> >>>>>   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
> >>>>>         case "$__OCF_ACTION" in
> >>>>>         stop)
> >>>>>           ocf_log debug "domain $1 reported as not running, waiting
> $cnt
> >>>>> seconds ..."
> >>>>>           ;;
> >>>>>         monitor)
> >>>>>           ocf_log warn "domain $1 reported as not running, but it is
> >>>>> expected to be running! Retrying for $cnt seconds ..."
> >>>>>           ;;
> >>>>>         *) : not reachable
> >>>>>                 ;;
> >>>>>         esac
> >>>>>         sleep 1
> >>>>>         Xen_Status $1
> >>>>>         rc=$?
> >>>>>         let cnt=$((cnt-1))
> >>>>>   done
> >>>>>   return $rc
> >>>>> }
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
> >>>>>> Hi Tom,
> >>>>>>
> >>>>>> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
> >>>>>>> Hi Dejan
> >>>>>>>
> >>>>>>> Just a quick question.  I cannot see your new log messages being
> logged
> >>>>>>> to syslog
> >>>>>>>
> >>>>>>> ocf_log warn "domain $1 reported as not running, but it is expected
> to
> >>>>>>> be running! Retrying for $cnt seconds ...
> >>>>>>>
> >>>>>>> Do you know where I can set my logging to see warn level messages?  I
> >>>>>>> expected to see them in my testing by default but that does not seem
> to
> >>>>>>> be true.
> >>>>>> You should see them by default. But note that these warnings may
> >>>>>> not happen, depending on the circumstances on your host. In my
> >>>>>> experiments they were logged only while the guest was rebooting
> >>>>>> and then just once or maybe twice. If you have recent
> >>>>>> resource-agents and crmsh, you can enable operation tracing (with
> >>>>>> crm resource trace <rsc> monitor <interval>).
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Dejan
> >>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>> Tom
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
> >>>>>>>>> Hi!
> >>>>>>>>>
> >>>>>>>>> I thought, I'll never be bitten by this bug, but I actually was! Now
> I'm
> >>>>>>>>> wondering whether the Xen RA sees the guest if you use pygrub, and
> pygrub is
> >>>>>>>>> still counting down for actual boot...
> >>>>>>>>>
> >>>>>>>>> But the reason why I'm writing is that I think I've discovered
> another bug 
> > in
> >>>>>>>>> the RA:
> >>>>>>>>>
> >>>>>>>>> CRM decided to "recover" the guest VM "v02":
> >>>>>>>>> [...]
> >>>>>>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client
> 14906:
> >>>>>>>>> pid 19516 exited with return code 7
> >>>>>>>>> [...]
> >>>>>>>>>  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started
> h05)
> >>>>>>>>> [...]
> >>>>>>>>>  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
> >>>>>>>>> prm_xen_v02_stop_0 on h05 (local)
> >>>>>>>>> [...]
> >>>>>>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
> >>>>>>>>> [...]
> >>>>>>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client
> 14906: pid
> >>>>>>>>> 19552 exited with return code 0
> >>>>>>>>> [...]
> >>>>>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
> >>>>>>>>> prm_xen_v02_start_0 on h05 (local)
> >>>>>>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
> >>>>>>>>> [...]
> >>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error:
> Domain 
> > 'v02'
> >>>>>>>>> already exists with ID '3'
> >>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using
> config file
> >>>>>>>>> "/etc/xen/vm/v02".
> >>>>>>>>> [...]
> >>>>>>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client
> 14906: 
> > pid
> >>>>>>>>> 19686 exited with return code 1
> >>>>>>>>> [...]
> >>>>>>>>> crmd: [14906]: info: process_lrm_event: LRM operation
> prm_xen_v02_start_0
> >>>>>>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
> >>>>>>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0)
> on h05
> >>>>>>>>> failed (target: 0 vs. rc: 1): Error
> >>>>>>>>> [...]
> >>>>>>>>>
> >>>>>>>>> As you can clearly see "start" failed, because the guest was found
> up 
> > already!
> >>>>>>>>> IMHO this is a bug in the RA (SLES11 SP2:
> resource-agents-3.9.4-0.26.84).
> >>>>>>>> Yes, I've seen that. It's basically the same issue, i.e. the
> >>>>>>>> domain being gone for a while and then reappearing.
> >>>>>>>>
> >>>>>>>>> I guess the following test is problematic:
> >>>>>>>>> ---
> >>>>>>>>>   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
> >>>>>>>>>   rc=$?
> >>>>>>>>>   if [ $rc -ne 0 ]; then
> >>>>>>>>>     return $OCF_ERR_GENERIC
> >>>>>>>>> ---
> >>>>>>>>> Here "xm create" probably fails if the guest is already created...
> >>>>>>>> It should fail too. Note that this is a race, but the race is
> >>>>>>>> anyway caused by the strange behaviour of xen. With the recent
> >>>>>>>> fix (or workaround) in the RA, this shouldn't be happening.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Dejan
> >>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Ulrich
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>> Dejan Muhamedagic <[email protected]> schrieb am 01.10.2013 um
> 12:24 in
> >>>>>>>>> Nachricht <[email protected]>:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree
> wrote:
> >>>>>>>>>>> On 2013-10-01T00:53:15, Tom Parker <[email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks for paying attention to this issue (not really a bug) as I
> am
> >>>>>>>>>>>> sure I am not the only one with this issue.  For now I have set
> all my
> >>>>>>>>>>>> VMs to destroy so that the cluster is the only thing managing
> them but
> >>>>>>>>>>>> this is not super clean as I get failures in my logs that are not
> really
> >>>>>>>>>>>> failures.
> >>>>>>>>>>> It is very much a severe bug.
> >>>>>>>>>>>
> >>>>>>>>>>> The Xen RA has gained a workaround for this now, but we're also
> pushing
> >>>>>>>>>> Take a look here:
> >>>>>>>>>>
> >>>>>>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Dejan
> >>>>>>>>>>
> >>>>>>>>>>> the Xen team (where the real problem is) to investigate and fix.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>>     Lars
> >>>>>>>>>>>
> >>>>>>>>>>> -- 
> >>>>>>>>>>> Architect Storage/HA
> >>>>>>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
> Imendörffer,
> >>>>>>>>>> HRB 21284 (AG Nürnberg)
> >>>>>>>>>>> "Experience is the name everyone gives to their mistakes." --
> Oscar Wilde
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Linux-HA mailing list
> >>>>>>>>>>> [email protected] 
> >>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Linux-HA mailing list
> >>>>>>>>>> [email protected] 
> >>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Linux-HA mailing list
> >>>>>>>>> [email protected] 
> >>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
> >>>>>>>> _______________________________________________
> >>>>>>>> Linux-HA mailing list
> >>>>>>>> [email protected] 
> >>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>>>>>>> See also: http://linux-ha.org/ReportingProblems 
> >>>>>>> _______________________________________________
> >>>>>>> Linux-HA mailing list
> >>>>>>> [email protected] 
> >>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>>>>>> See also: http://linux-ha.org/ReportingProblems 
> >>>>>> _______________________________________________
> >>>>>> Linux-HA mailing list
> >>>>>> [email protected] 
> >>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>>>>> See also: http://linux-ha.org/ReportingProblems 
> >>>>> _______________________________________________
> >>>>> Linux-HA mailing list
> >>>>> [email protected] 
> >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>>>> See also: http://linux-ha.org/ReportingProblems 
> >>>> _______________________________________________
> >>>> Linux-HA mailing list
> >>>> [email protected] 
> >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>>> See also: http://linux-ha.org/ReportingProblems 
> >>> _______________________________________________
> >>> Linux-HA mailing list
> >>> [email protected] 
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>> See also: http://linux-ha.org/ReportingProblems 
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected] 
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >> See also: http://linux-ha.org/ReportingProblems 
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to