Re: [PROPOSAL] Job instance scaling APIs

Maxim Khutornenko Thu, 14 Jan 2016 12:51:07 -0800

"I'd be concerned that any
scaling API to be powerful enough to fit all (most) use cases would just
end up looking like the update API."


There is a big difference between scaleOut and startJobUpdate APIs
that justifies the inclusion of the former. Namely, scaleOut may only
replicate the existing instances without changing/introducing any new
scheduling requirements or performing instance rollout/rollback. I
don't see scaleOut ever becoming more powerful to threaten
startJobUpdate. At the same time, the absence of aurora config
requirement is a huge boost to autoscaling client simplification.

"For example, when scaling down we don't just kill the last N instances, we
actually look at the least loaded hosts (globally) and kill tasks from
those."

I don't quite see why the same wouldn't be possible with a scaleIn
API. Isn't it always external process responsibility to pay due
diligence before killing instances?


On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sniem...@apache.org> wrote:
> As some background, we handle scale up / down purely from the client side,
> using the update API for both directions.  I'd be concerned that any
> scaling API to be powerful enough to fit all (most) use cases would just
> end up looking like the update API.
>
> For example, when scaling down we don't just kill the last N instances, we
> actually look at the least loaded hosts (globally) and kill tasks from
> those.
>
>
> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> "How is scaling down different from killing instances?"
>>
>> I found 'killTasks' syntax too different and way much more powerful to
>> be used for scaling in. The TaskQuery allows killing instances across
>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
>> Additional benefit: it can be ACLed independently by allowing external
>> process kill tasks only within a given job. We may also add rate
>> limiting or backoff to it later.
>>
>> As for Joshua's question, I feel it should be an operator's
>> responsibility to diff a job with its aurora config before applying an
>> update. That said, if there is enough demand we can definitely
>> consider adding something similar to what George suggested or
>> resurrecting a 'large change' warning message we used to have in
>> client updater.
>>
>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <geo...@tellapart.com>
>> wrote:
>> > As a point of reference, we solved this problem by adding a binding
>> helper
>> > that queries the scheduler for the current number of instances and uses
>> > that number instead of a hardcoded config:
>> >
>> >    instances='{{scaling_instances[60]}}'
>> >
>> > In this example, instances will be set to the currently running number
>> > (unless there are none, in which case 60 instances will be created).
>> >
>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jco...@apache.org> wrote:
>> >
>> >> What happens if a job has been scaled out, but the underlying config is
>> not
>> >> updated to take that scaling into account? Would the next update on that
>> >> job revert the number of instances (presumably, because what else could
>> we
>> >> do)? Is there anything we can do, tooling-wise, to improve upon this?
>> >>
>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org>
>> >> wrote:
>> >>
>> >> > Our rolling update APIs can be quite inconvenient to work with when it
>> >> > comes to instance scaling [1]. It's especially frustrating when
>> >> > adding/removing instances has to be done in an automated fashion
>> (e.g.:
>> >> by
>> >> > an external autoscaling process) as it requires holding on to the
>> >> original
>> >> > aurora config at all times.
>> >> >
>> >> > I propose we add simple instance scaling APIs to address the above.
>> Since
>> >> > Aurora job may have instances at different configs at any moment, I
>> >> propose
>> >> > we accept an InstanceKey as a reference point when scaling out. For
>> >> > example:
>> >> >
>> >> >     /** Scales out a given job by adding more instances with the task
>> >> > config of the templateKey. */
>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>> incrementCount)
>> >> >
>> >> >     /** Scales in a given job by removing existing instances. */
>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>> >> >
>> >> > A correspondent client command could then look like:
>> >> >
>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>> >> >
>> >> > For the above command, a scheduler would take task config of instance
>> 1
>> >> of
>> >> > the 'hello' job and replicate it 10 more times thus adding 10
>> additional
>> >> > instances to the job.
>> >> >
>> >> > There are, of course, some details to work out like making sure no
>> active
>> >> > update is in flight, scale out does not violate quota and etc. I
>> intend
>> >> to
>> >> > address those during the implementation as things progress.
>> >> >
>> >> > Does the above make sense? Any concerns/suggestions?
>> >> >
>> >> > Thanks,
>> >> > Maxim
>> >> >
>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>> >> >
>> >>
>>

Re: [PROPOSAL] Job instance scaling APIs

Reply via email to