Re: [PROPOSAL] Job instance scaling APIs

Tony Dong Fri, 15 Jan 2016 10:18:54 -0800

The API change is basically what I had imagined. +1

I really imagined a scaling down command as a wrapper to job kill.
Although another approach would be to augment the current job kill command
to kill the last N instance of a job.


i.e. aurora job kill devcluster/vagrant/test/hello 10

On Fri, Jan 15, 2016 at 10:06 AM, Maxim Khutornenko <ma...@apache.org>
wrote:

> I wasn't planning on using the rolling updater functionality given the
> simplicity of the operation. I'd second Steve's earlier concerns about
> scaleOut() looking more like startJobUpdate() if we keep adding
> features. If health watching, throttling (batch_size) or rollback on
> failure is required then I believe the startJobUpdate() should be used
> instead of scaleOut().
>
> On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
> <stephan....@blue-yonder.com> wrote:
> > I really like the proposal. The gain in simplicity on the client-side by
> not having to provide an aurora config is quite significant.
> >
> > The implementation on the scheduler side is probably rather straight
> forward as the update can be reused. That would also provide us with the
> update UI, which has shown to be quite useful when tracing autoscaler
> events.
> >
> > Regards,
> > Stephan
> > ________________________________________
> > From: Maxim Khutornenko <ma...@apache.org>
> > Sent: Thursday, January 14, 2016 9:50 PM
> > To: dev@aurora.apache.org
> > Subject: Re: [PROPOSAL] Job instance scaling APIs
> >
> > "I'd be concerned that any
> > scaling API to be powerful enough to fit all (most) use cases would just
> > end up looking like the update API."
> >
> > There is a big difference between scaleOut and startJobUpdate APIs
> > that justifies the inclusion of the former. Namely, scaleOut may only
> > replicate the existing instances without changing/introducing any new
> > scheduling requirements or performing instance rollout/rollback. I
> > don't see scaleOut ever becoming more powerful to threaten
> > startJobUpdate. At the same time, the absence of aurora config
> > requirement is a huge boost to autoscaling client simplification.
> >
> > "For example, when scaling down we don't just kill the last N instances,
> we
> > actually look at the least loaded hosts (globally) and kill tasks from
> > those."
> >
> > I don't quite see why the same wouldn't be possible with a scaleIn
> > API. Isn't it always external process responsibility to pay due
> > diligence before killing instances?
> >
> >
> > On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sniem...@apache.org>
> wrote:
> >> As some background, we handle scale up / down purely from the client
> side,
> >> using the update API for both directions.  I'd be concerned that any
> >> scaling API to be powerful enough to fit all (most) use cases would just
> >> end up looking like the update API.
> >>
> >> For example, when scaling down we don't just kill the last N instances,
> we
> >> actually look at the least loaded hosts (globally) and kill tasks from
> >> those.
> >>
> >>
> >> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
> >>
> >>> "How is scaling down different from killing instances?"
> >>>
> >>> I found 'killTasks' syntax too different and way much more powerful to
> >>> be used for scaling in. The TaskQuery allows killing instances across
> >>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
> >>> Additional benefit: it can be ACLed independently by allowing external
> >>> process kill tasks only within a given job. We may also add rate
> >>> limiting or backoff to it later.
> >>>
> >>> As for Joshua's question, I feel it should be an operator's
> >>> responsibility to diff a job with its aurora config before applying an
> >>> update. That said, if there is enough demand we can definitely
> >>> consider adding something similar to what George suggested or
> >>> resurrecting a 'large change' warning message we used to have in
> >>> client updater.
> >>>
> >>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <geo...@tellapart.com>
> >>> wrote:
> >>> > As a point of reference, we solved this problem by adding a binding
> >>> helper
> >>> > that queries the scheduler for the current number of instances and
> uses
> >>> > that number instead of a hardcoded config:
> >>> >
> >>> >    instances='{{scaling_instances[60]}}'
> >>> >
> >>> > In this example, instances will be set to the currently running
> number
> >>> > (unless there are none, in which case 60 instances will be created).
> >>> >
> >>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jco...@apache.org>
> wrote:
> >>> >
> >>> >> What happens if a job has been scaled out, but the underlying
> config is
> >>> not
> >>> >> updated to take that scaling into account? Would the next update on
> that
> >>> >> job revert the number of instances (presumably, because what else
> could
> >>> we
> >>> >> do)? Is there anything we can do, tooling-wise, to improve upon
> this?
> >>> >>
> >>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
> ma...@apache.org>
> >>> >> wrote:
> >>> >>
> >>> >> > Our rolling update APIs can be quite inconvenient to work with
> when it
> >>> >> > comes to instance scaling [1]. It's especially frustrating when
> >>> >> > adding/removing instances has to be done in an automated fashion
> >>> (e.g.:
> >>> >> by
> >>> >> > an external autoscaling process) as it requires holding on to the
> >>> >> original
> >>> >> > aurora config at all times.
> >>> >> >
> >>> >> > I propose we add simple instance scaling APIs to address the
> above.
> >>> Since
> >>> >> > Aurora job may have instances at different configs at any moment,
> I
> >>> >> propose
> >>> >> > we accept an InstanceKey as a reference point when scaling out.
> For
> >>> >> > example:
> >>> >> >
> >>> >> >     /** Scales out a given job by adding more instances with the
> task
> >>> >> > config of the templateKey. */
> >>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
> >>> incrementCount)
> >>> >> >
> >>> >> >     /** Scales in a given job by removing existing instances. */
> >>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> >>> >> >
> >>> >> > A correspondent client command could then look like:
> >>> >> >
> >>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> >>> >> >
> >>> >> > For the above command, a scheduler would take task config of
> instance
> >>> 1
> >>> >> of
> >>> >> > the 'hello' job and replicate it 10 more times thus adding 10
> >>> additional
> >>> >> > instances to the job.
> >>> >> >
> >>> >> > There are, of course, some details to work out like making sure no
> >>> active
> >>> >> > update is in flight, scale out does not violate quota and etc. I
> >>> intend
> >>> >> to
> >>> >> > address those during the implementation as things progress.
> >>> >> >
> >>> >> > Does the above make sense? Any concerns/suggestions?
> >>> >> >
> >>> >> > Thanks,
> >>> >> > Maxim
> >>> >> >
> >>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >>> >> >
> >>> >>
> >>>
>

Re: [PROPOSAL] Job instance scaling APIs

Reply via email to