Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Ulrich Windl Mon, 21 Oct 2013 00:30:17 -0700

Hi!

Basically I think there should be no hard-coded constants whose value depends
on some performance measurements, like 5s for rebooting a VM. So I support
Tom's changes.


However I noticed:

+running; apparently, this period lasts only for a second or
+two

(missing full stop at end of sentence)

Actually I'd rephrase the description:

"When the guest is rebooting, there is a short interval where the guest
completely disappears from "xm list", which, in turn, will cause the monitor
operation to return a "not running" status. If the guest cannot be found , this
value will cause some extra delay in the monitor operation to work around the
problem."

(I.e. try to describe the effect, not the implementation)

And yes, I appreciate consistent log formats also ;-)

Regards,
Ulrich

>>> Tom Parker <tpar...@cbnco.com> schrieb am 18.10.2013 um 19:30 in
Nachricht
<5261703a.5070...@cbnco.com>:
> Hi Dejan.  Sorry to be slow to respond to this.  I have done some
> testing and everything looks good. 
> 
> I spent some time tweaking the RA and I added a parameter called
> wait_for_reboot (default 5s) to allow us to override the reboot sleep
> times (in case it's more than 5 seconds on really loaded hypervisors). 
> I also cleaned up a few log entries to make them consistent in the RA
> and edited your entries for xen status to be a little bit more clear as
> to why we think we should be waiting. 
> 
> I have attached a patch here because I have NO idea how to create a
> branch and pull request.  If there are links to a good place to start I
> may be able to contribute occasionally to some other RAs that I use.
> 
> Please let me know what you think.
> 
> Thanks for your help
> 
> Tom
> 
> 
> On 10/17/2013 06:10 AM, Dejan Muhamedagic wrote:
>> On Thu, Oct 17, 2013 at 11:45:17AM +0200, Dejan Muhamedagic wrote:
>>> Hi Tom,
>>>
>>> On Wed, Oct 16, 2013 at 05:28:28PM -0400, Tom Parker wrote:
>>>> Some more reading of the source code makes me think the " || [
>>>> "$__OCF_ACTION" != "stop" ]; "is not needed. 
>>> Yes, you're right. I'll drop that part of the if statement. Many
>>> thanks for testing.
>> Fixed now. The if statement, which was obviously hard to follow,
>> got relegated to the monitor function.  Which makes the
>> Xen_Status_with_Retry really stand for what's happening in there ;-)
>>
>> Tom, hope you can test again.
>>
>> Cheers,
>>
>> Dejan
>>
>>> Cheers,
>>>
>>> Dejan
>>>
>>>> Xen_Status_with_Retry() is only called from Stop and Monitor so we only
>>>> need to check if it's a probe.  Everything else should be handled in the
>>>> case statement in the loop.
>>>>
>>>> Tom
>>>>
>>>> On 10/16/2013 05:16 PM, Tom Parker wrote:
>>>>> Hi.  I think there is an issue with the Updated Xen RA.
>>>>>
>>>>> I think there is an issue with the if statement here but I am not sure.

>>>>> I may be confused about how bash || works but I don't see my servers
>>>>> ever entering the loop on a vm disappearing.
>>>>>
>>>>> if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>>>         return $rc
>>>>> fi
>>>>>
>>>>> Does this not mean that if we run a monitor operation that is not a
>>>>> probe we will have:
>>>>>
>>>>> (ocf_is_probe) return false
>>>>> (stop != monitor) return true
>>>>> (false || true) return true
>>>>>
>>>>> which will cause the if statement to return $rc and never enter the
loop? 
>>>>>
>>>>> Xen_Status_with_Retry() {
>>>>>   local rc cnt=5
>>>>>
>>>>>   Xen_Status $1
>>>>>   rc=$?
>>>>>   if ocf_is_probe || [ "$__OCF_ACTION" != "stop" ]; then
>>>>>         return $rc
>>>>>   fi
>>>>>   while [ $rc -eq $OCF_NOT_RUNNING -a $cnt -gt 0 ]; do
>>>>>         case "$__OCF_ACTION" in
>>>>>         stop)
>>>>>           ocf_log debug "domain $1 reported as not running, waiting
$cnt
>>>>> seconds ..."
>>>>>           ;;
>>>>>         monitor)
>>>>>           ocf_log warn "domain $1 reported as not running, but it is
>>>>> expected to be running! Retrying for $cnt seconds ..."
>>>>>           ;;
>>>>>         *) : not reachable
>>>>>                 ;;
>>>>>         esac
>>>>>         sleep 1
>>>>>         Xen_Status $1
>>>>>         rc=$?
>>>>>         let cnt=$((cnt-1))
>>>>>   done
>>>>>   return $rc
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> On 10/16/2013 12:12 PM, Dejan Muhamedagic wrote:
>>>>>> Hi Tom,
>>>>>>
>>>>>> On Tue, Oct 15, 2013 at 07:55:11PM -0400, Tom Parker wrote:
>>>>>>> Hi Dejan
>>>>>>>
>>>>>>> Just a quick question.  I cannot see your new log messages being
logged
>>>>>>> to syslog
>>>>>>>
>>>>>>> ocf_log warn "domain $1 reported as not running, but it is expected
to
>>>>>>> be running! Retrying for $cnt seconds ...
>>>>>>>
>>>>>>> Do you know where I can set my logging to see warn level messages?  I
>>>>>>> expected to see them in my testing by default but that does not seem
to
>>>>>>> be true.
>>>>>> You should see them by default. But note that these warnings may
>>>>>> not happen, depending on the circumstances on your host. In my
>>>>>> experiments they were logged only while the guest was rebooting
>>>>>> and then just once or maybe twice. If you have recent
>>>>>> resource-agents and crmsh, you can enable operation tracing (with
>>>>>> crm resource trace <rsc> monitor <interval>).
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dejan
>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Tom
>>>>>>>
>>>>>>>
>>>>>>> On 10/08/2013 05:04 PM, Dejan Muhamedagic wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Tue, Oct 08, 2013 at 01:52:56PM +0200, Ulrich Windl wrote:
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> I thought, I'll never be bitten by this bug, but I actually was! Now
I'm
>>>>>>>>> wondering whether the Xen RA sees the guest if you use pygrub, and
pygrub is
>>>>>>>>> still counting down for actual boot...
>>>>>>>>>
>>>>>>>>> But the reason why I'm writing is that I think I've discovered
another bug 
> in
>>>>>>>>> the RA:
>>>>>>>>>
>>>>>>>>> CRM decided to "recover" the guest VM "v02":
>>>>>>>>> [...]
>>>>>>>>> lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client
14906:
>>>>>>>>> pid 19516 exited with return code 7
>>>>>>>>> [...]
>>>>>>>>>  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started
h05)
>>>>>>>>> [...]
>>>>>>>>>  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
>>>>>>>>> prm_xen_v02_stop_0 on h05 (local)
>>>>>>>>> [...]
>>>>>>>>> Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
>>>>>>>>> [...]
>>>>>>>>> lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client
14906: pid
>>>>>>>>> 19552 exited with return code 0
>>>>>>>>> [...]
>>>>>>>>> crmd: [14906]: info: te_rsc_command: Initiating action 78: start
>>>>>>>>> prm_xen_v02_start_0 on h05 (local)
>>>>>>>>> lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
>>>>>>>>> [...]
>>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error:
Domain 
> 'v02'
>>>>>>>>> already exists with ID '3'
>>>>>>>>> lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using
config file
>>>>>>>>> "/etc/xen/vm/v02".
>>>>>>>>> [...]
>>>>>>>>> lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client
14906: 
> pid
>>>>>>>>> 19686 exited with return code 1
>>>>>>>>> [...]
>>>>>>>>> crmd: [14906]: info: process_lrm_event: LRM operation
prm_xen_v02_start_0
>>>>>>>>> (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
>>>>>>>>> crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0)
on h05
>>>>>>>>> failed (target: 0 vs. rc: 1): Error
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>> As you can clearly see "start" failed, because the guest was found
up 
> already!
>>>>>>>>> IMHO this is a bug in the RA (SLES11 SP2:
resource-agents-3.9.4-0.26.84).
>>>>>>>> Yes, I've seen that. It's basically the same issue, i.e. the
>>>>>>>> domain being gone for a while and then reappearing.
>>>>>>>>
>>>>>>>>> I guess the following test is problematic:
>>>>>>>>> ---
>>>>>>>>>   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
>>>>>>>>>   rc=$?
>>>>>>>>>   if [ $rc -ne 0 ]; then
>>>>>>>>>     return $OCF_ERR_GENERIC
>>>>>>>>> ---
>>>>>>>>> Here "xm create" probably fails if the guest is already created...
>>>>>>>> It should fail too. Note that this is a race, but the race is
>>>>>>>> anyway caused by the strange behaviour of xen. With the recent
>>>>>>>> fix (or workaround) in the RA, this shouldn't be happening.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Dejan
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Ulrich
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 01.10.2013 um
12:24 in
>>>>>>>>> Nachricht <20131001102430.GA4687@walrus.homenet>:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree
wrote:
>>>>>>>>>>> On 2013-10-01T00:53:15, Tom Parker <tpar...@cbnco.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for paying attention to this issue (not really a bug) as I
am
>>>>>>>>>>>> sure I am not the only one with this issue.  For now I have set
all my
>>>>>>>>>>>> VMs to destroy so that the cluster is the only thing managing
them but
>>>>>>>>>>>> this is not super clean as I get failures in my logs that are not
really
>>>>>>>>>>>> failures.
>>>>>>>>>>> It is very much a severe bug.
>>>>>>>>>>>
>>>>>>>>>>> The Xen RA has gained a workaround for this now, but we're also
pushing
>>>>>>>>>> Take a look here:
>>>>>>>>>>
>>>>>>>>>> https://github.com/ClusterLabs/resource-agents/pull/314 
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Dejan
>>>>>>>>>>
>>>>>>>>>>> the Xen team (where the real problem is) to investigate and fix.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>     Lars
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Architect Storage/HA
>>>>>>>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
Imendörffer,
>>>>>>>>>> HRB 21284 (AG Nürnberg)
>>>>>>>>>>> "Experience is the name everyone gives to their mistakes." --
Oscar Wilde
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Linux-HA mailing list
>>>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>>> _______________________________________________
>>>>>>>>> Linux-HA mailing list
>>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>>> _______________________________________________
>>>>>>>> Linux-HA mailing list
>>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>>> _______________________________________________
>>>>>>> Linux-HA mailing list
>>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>>> _______________________________________________
>>>>>> Linux-HA mailing list
>>>>>> Linux-HA@lists.linux-ha.org 
>>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> Linux-HA@lists.linux-ha.org 
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>>> See also: http://linux-ha.org/ReportingProblems 
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> Linux-HA@lists.linux-ha.org 
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>>> See also: http://linux-ha.org/ReportingProblems 
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA@lists.linux-ha.org 
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>>> See also: http://linux-ha.org/ReportingProblems 
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org 
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>> See also: http://linux-ha.org/ReportingProblems 


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to