Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Dejan Muhamedagic Wed, 16 Oct 2013 09:18:22 -0700

Hi,

On Thu, Oct 10, 2013 at 08:29:04AM -0400, Tom Parker wrote:
> This scares me too.  If the start operation finds a running vm and
> fails, my cluster config will automatically try to start the same VM on
> the next node it has available.  This scenario almost guarantees
> duplicate VMs even if I  have the on_reboot=destroy. 
> 
> Dejan,  I am not sure but I don't think your patch will take care of
> this.  In my opinion a start that finds a running version should return
> success (vm should be started and it is.)


The start operation first checks the VM status. It is assumed
that the guest is not rebooting at the very same time the
local resource manager runs the start operation. The chance for
that happening is really low. Otherwise, _all_ start operations
would be penalized with a 5 second delay.

Thanks,

Dejan

> Tom
> 
> On 10/08/2013 07:52 AM, Ulrich Windl wrote:
> > Hi!
> >
> > I thought, I'll never be bitten by this bug, but I actually was! Now I'm
> > wondering whether the Xen RA sees the guest if you use pygrub, and pygrub is
> > still counting down for actual boot...
> >
> > But the reason why I'm writing is that I think I've discovered another bug 
> > in
> > the RA:
> >
> > CRM decided to "recover" the guest VM "v02":
> > [...]
> > lrmd: [14903]: info: operation monitor[28] on prm_xen_v02 for client 14906:
> > pid 19516 exited with return code 7
> > [...]
> >  pengine: [14905]: notice: LogActions: Recover prm_xen_v02 (Started h05)
> > [...]
> >  crmd: [14906]: info: te_rsc_command: Initiating action 5: stop
> > prm_xen_v02_stop_0 on h05 (local)
> > [...]
> > Xen(prm_xen_v02)[19552]: INFO: Xen domain v02 already stopped.
> > [...]
> > lrmd: [14903]: info: operation stop[31] on prm_xen_v02 for client 14906: pid
> > 19552 exited with return code 0
> > [...]
> > crmd: [14906]: info: te_rsc_command: Initiating action 78: start
> > prm_xen_v02_start_0 on h05 (local)
> > lrmd: [14903]: info: rsc:prm_xen_v02 start[32] (pid 19686)
> > [...]
> > lrmd: [14903]: info: RA output: (prm_xen_v02:start:stderr) Error: Domain 
> > 'v02'
> > already exists with ID '3'
> > lrmd: [14903]: info: RA output: (prm_xen_v02:start:stdout) Using config file
> > "/etc/xen/vm/v02".
> > [...]
> > lrmd: [14903]: info: operation start[32] on prm_xen_v02 for client 14906: 
> > pid
> > 19686 exited with return code 1
> > [...]
> > crmd: [14906]: info: process_lrm_event: LRM operation prm_xen_v02_start_0
> > (call=32, rc=1, cib-update=5271, confirmed=true) unknown error
> > crmd: [14906]: WARN: status_from_rc: Action 78 (prm_xen_v02_start_0) on h05
> > failed (target: 0 vs. rc: 1): Error
> > [...]
> >
> > As you can clearly see "start" failed, because the guest was found up 
> > already!
> > IMHO this is a bug in the RA (SLES11 SP2: resource-agents-3.9.4-0.26.84).
> >
> > I guess the following test is problematic:
> > ---
> >   xm create ${OCF_RESKEY_xmfile} name=$DOMAIN_NAME
> >   rc=$?
> >   if [ $rc -ne 0 ]; then
> >     return $OCF_ERR_GENERIC
> > ---
> > Here "xm create" probably fails if the guest is already created...
> >
> > Regards,
> > Ulrich
> >
> >
> >>>> Dejan Muhamedagic <[email protected]> schrieb am 01.10.2013 um 12:24 in
> > Nachricht <[email protected]>:
> >> Hi,
> >>
> >> On Tue, Oct 01, 2013 at 12:13:02PM +0200, Lars Marowsky-Bree wrote:
> >>> On 2013-10-01T00:53:15, Tom Parker <[email protected]> wrote:
> >>>
> >>>> Thanks for paying attention to this issue (not really a bug) as I am
> >>>> sure I am not the only one with this issue.  For now I have set all my
> >>>> VMs to destroy so that the cluster is the only thing managing them but
> >>>> this is not super clean as I get failures in my logs that are not really
> >>>> failures.
> >>> It is very much a severe bug.
> >>>
> >>> The Xen RA has gained a workaround for this now, but we're also pushing
> >> Take a look here:
> >>
> >> https://github.com/ClusterLabs/resource-agents/pull/314 
> >>
> >> Thanks,
> >>
> >> Dejan
> >>
> >>> the Xen team (where the real problem is) to investigate and fix.
> >>>
> >>>
> >>> Regards,
> >>>     Lars
> >>>
> >>> -- 
> >>> Architect Storage/HA
> >>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
> >>> Imendörffer,
> >> HRB 21284 (AG Nürnberg)
> >>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> >>>
> >>> _______________________________________________
> >>> Linux-HA mailing list
> >>> [email protected] 
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >>> See also: http://linux-ha.org/ReportingProblems 
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected] 
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> >> See also: http://linux-ha.org/ReportingProblems 
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Xen RA and rebooting

Reply via email to