Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

Jastrzebski, Michal Thu, 13 Nov 2014 06:35:04 -0800

Guys, I don't think we want to get into this cluster management mud. You say 
let's
make observer...and what if observer dies? Do we do observer to observer? And 
then
there is split brain. I'm observer, I've lost connection to worker. Should I 
restart a worker?
Maybe I'm one who lost connection to the rest of the world? Should I resume 
task and risk
duplicate workload?


And then there is another problem. If there is timeout caused by limit of 
resources of workers,
if  we restart whole workload after timeout, we will stretch these resources 
even further, and in turn
we'll get more timeouts (...) <- great way to kill whole setup.

So we get to horizontal scalability. Or total lack of it. Any stack that is too 
complicated for single engine
to process will be impossible to process at all. We should find a way to 
distribute workloads in
active-active, stateless (as much as possible) manner.

Regards,
Michał "inc0" Jastrzębski   

> -----Original Message-----
> From: Murugan, Visnusaran [mailto:visnusaran.muru...@hp.com]
> Sent: Thursday, November 13, 2014 2:59 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops
> 
> Zane,
> 
> We do follow shardy's suggestion of having worker/observer as eventlet in
> heat-engine. No new process. The timer will be executed under an engine's
> worker.
> 
> Question:
> 1. heat-engine processing resource-action failed (process killed) 2. heat-
> engine processing timeout for a stack fails (process killed)
> 
> In the above mentioned cases, I thought celery tasks would come to our
> rescue.
> 
> Convergence-poc implementation can recover from error and retry if there is
> a notification available.
> 
> 
> -Vishnu
> 
> -----Original Message-----
> From: Zane Bitter [mailto:zbit...@redhat.com]
> Sent: Thursday, November 13, 2014 7:05 PM
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops
> 
> On 13/11/14 06:52, Angus Salkeld wrote:
> > On Thu, Nov 13, 2014 at 6:29 PM, Murugan, Visnusaran
> > <visnusaran.muru...@hp.com <mailto:visnusaran.muru...@hp.com>>
> wrote:
> >
> >     Hi all,____
> >
> >     __ __
> >
> >     Convergence-POC distributes stack operations by sending resource
> >     actions over RPC for any heat-engine to execute. Entire stack
> >     lifecycle will be controlled by worker/observer notifications. This
> >     distributed model has its own advantages and disadvantages.____
> >
> >     __ __
> >
> >     Any stack operation has a timeout and a single engine will be
> >     responsible for it. If that engine goes down, timeout is lost along
> >     with it. So a traditional way is for other engines to recreate
> >     timeout from scratch. Also a missed resource action notification
> >     will be detected only when stack operation timeout happens. __ __
> >
> >     __ __
> >
> >     To overcome this, we will need the following capability:____
> >
> >     __1.__Resource timeout (can be used for retry)
> >
> > We will shortly have a worker job, can't we have a job that just
> > sleeps that gets started in parallel with the job that is doing the work?
> > It gets to the end of the sleep and runs a check.
> 
> What if that worker dies too? There's no guarantee that it'd even be a
> different worker. In fact, there's not even a guarantee that we'd have
> multiple workers.
> 
> BTW Steve Hardy's suggestion, which I have more or less come around to, is
> that the engines themselves should be the workers in convergence, to save
> operators deploying two types of processes. (The observers will still be a
> separate process though, in phase 2.)
> 
> >     ____
> >
> >     __2.__Recover from engine failure (loss of stack timeout, resource
> >     action notification)____
> >
> >     __
> >
> >
> > My suggestion above could catch failures as long as it was run in a
> > different process.
> >
> > -Angus
> >
> >     __
> >
> >     __ __
> >
> >     Suggestion:____
> >
> >     __1.__Use task queue like celery to host timeouts for both stack and
> >     resource.____
> >
> >     __2.__Poll database for engine failures and restart timers/
> >     retrigger resource retry (IMHO: This would be a traditional and
> >     weighs heavy)____
> >
> >     __3.__Migrate heat to use TaskFlow. (Too many code change)____
> >
> >     __ __
> >
> >     I am not suggesting we use Task Flow. Using celery will have very
> >     minimum code change. (decorate appropriate functions) ____
> >
> >     __ __
> >
> >     __ __
> >
> >     Your thoughts.____
> >
> >     __ __
> >
> >     -Vishnu____
> >
> >     IRC: ckmvishnu____
> >
> >
> >     _______________________________________________
> >     OpenStack-dev mailing list
> >     OpenStack-dev@lists.openstack.org
> >     <mailto:OpenStack-dev@lists.openstack.org>
> >     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

Reply via email to