Re: [PROPOSAL] Job instance scaling APIs

Bill Farner Tue, 19 Jan 2016 15:00:52 -0800

At risk of devolving the discussion, is it worth calling the method
addInstances as opposed to scaleOut?  I find the former more descriptive.


On Tue, Jan 19, 2016 at 11:12 AM, Maxim Khutornenko <ma...@apache.org>
wrote:

> "Of course, the scaler could manually health check that all instances
> have come up and are being used as expected, but I guess that is what
> Aurora is for."
>
> I'd argue the updater "watch_secs" health checking isn't enough to
> ensure graceful rollout as instances may start flapping right after
> the updater signs off. Instances outside of update window may also
> flap (e.g. due to backend pressure) and updater will not be able to
> catch that. That's why a robust autoscaler has to rely on external
> monitoring tools and overall job health instead.
>
> A very basic approach, as you mentioned above, could be querying job
> status repeatedly and count the ratio of tasks in RUNNING vs active
> (ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to make
> a scaleOut decision. The more reliable approach though would also rely
> on external monitoring stats exposed by user processes. That would be
> a much higher fidelity signal than a decision based on task status
> alone. Scheduler does not (and should not for scalability reasons)
> have visibility into those stats, so the autoscaler would be in a much
> better position to make an executive decision there.
>
> On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan
> <stephan....@blue-yonder.com> wrote:
> > I believe the operation is not that simple when you look at the
> end-to-end scenario.
> >
> > For example, the implementation of an auto-scaler  using the new
> scaleOut() API could look like:
> >
> > 1) check some KPI
> > 2) Infer an action based on this KPI such as scaleUp() or scaleDown()
> > 3) wait until the effects of the adjusted instance count is reflected in
> the KPI. Go to  1 and repeat.
> >
> > The health checking capabilities of the existing updater (in particular
> together with [1]) would be really helpful here. Still, the simplified
> scaleOut() API would offer the great benefit that the auto-scaler would not
> need to know about the used aurora configuration.
> >
> > We even had an incident with a sub-optimal implementation of step 3): An
> overloaded package backend lead to slow service startups. The service
> startup took longer than the grace-period of our auto-scaler. It  therefore
> decided to add more and more instances, because the KPI wasn't improving as
> expected. It had no way of knowing that these instances were not even
> 'running'. The additionally added instances aggravated the overload
> situation of the package backend.  Of course, the scaler could manually
> health check that all instances have come up and are being used as
> expected, but I guess that is what Aurora is for.
> >
> > [1]
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua
> >
> > Best Regards,
> > Stephan
> > ________________________________________
> > From: Maxim Khutornenko <ma...@apache.org>
> > Sent: Friday, January 15, 2016 7:06 PM
> > To: dev@aurora.apache.org
> > Subject: Re: [PROPOSAL] Job instance scaling APIs
> >
> > I wasn't planning on using the rolling updater functionality given the
> > simplicity of the operation. I'd second Steve's earlier concerns about
> > scaleOut() looking more like startJobUpdate() if we keep adding
> > features. If health watching, throttling (batch_size) or rollback on
> > failure is required then I believe the startJobUpdate() should be used
> > instead of scaleOut().
> >
> > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
> > <stephan....@blue-yonder.com> wrote:
> >> I really like the proposal. The gain in simplicity on the client-side
> by not having to provide an aurora config is quite significant.
> >>
> >> The implementation on the scheduler side is probably rather straight
> forward as the update can be reused. That would also provide us with the
> update UI, which has shown to be quite useful when tracing autoscaler
> events.
> >>
> >> Regards,
> >> Stephan
> >> ________________________________________
> >> From: Maxim Khutornenko <ma...@apache.org>
> >> Sent: Thursday, January 14, 2016 9:50 PM
> >> To: dev@aurora.apache.org
> >> Subject: Re: [PROPOSAL] Job instance scaling APIs
> >>
> >> "I'd be concerned that any
> >> scaling API to be powerful enough to fit all (most) use cases would just
> >> end up looking like the update API."
> >>
> >> There is a big difference between scaleOut and startJobUpdate APIs
> >> that justifies the inclusion of the former. Namely, scaleOut may only
> >> replicate the existing instances without changing/introducing any new
> >> scheduling requirements or performing instance rollout/rollback. I
> >> don't see scaleOut ever becoming more powerful to threaten
> >> startJobUpdate. At the same time, the absence of aurora config
> >> requirement is a huge boost to autoscaling client simplification.
> >>
> >> "For example, when scaling down we don't just kill the last N
> instances, we
> >> actually look at the least loaded hosts (globally) and kill tasks from
> >> those."
> >>
> >> I don't quite see why the same wouldn't be possible with a scaleIn
> >> API. Isn't it always external process responsibility to pay due
> >> diligence before killing instances?
> >>
> >>
> >> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sniem...@apache.org>
> wrote:
> >>> As some background, we handle scale up / down purely from the client
> side,
> >>> using the update API for both directions.  I'd be concerned that any
> >>> scaling API to be powerful enough to fit all (most) use cases would
> just
> >>> end up looking like the update API.
> >>>
> >>> For example, when scaling down we don't just kill the last N
> instances, we
> >>> actually look at the least loaded hosts (globally) and kill tasks from
> >>> those.
> >>>
> >>>
> >>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
> >>>
> >>>> "How is scaling down different from killing instances?"
> >>>>
> >>>> I found 'killTasks' syntax too different and way much more powerful to
> >>>> be used for scaling in. The TaskQuery allows killing instances across
> >>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
> >>>> Additional benefit: it can be ACLed independently by allowing external
> >>>> process kill tasks only within a given job. We may also add rate
> >>>> limiting or backoff to it later.
> >>>>
> >>>> As for Joshua's question, I feel it should be an operator's
> >>>> responsibility to diff a job with its aurora config before applying an
> >>>> update. That said, if there is enough demand we can definitely
> >>>> consider adding something similar to what George suggested or
> >>>> resurrecting a 'large change' warning message we used to have in
> >>>> client updater.
> >>>>
> >>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <geo...@tellapart.com
> >
> >>>> wrote:
> >>>> > As a point of reference, we solved this problem by adding a binding
> >>>> helper
> >>>> > that queries the scheduler for the current number of instances and
> uses
> >>>> > that number instead of a hardcoded config:
> >>>> >
> >>>> >    instances='{{scaling_instances[60]}}'
> >>>> >
> >>>> > In this example, instances will be set to the currently running
> number
> >>>> > (unless there are none, in which case 60 instances will be created).
> >>>> >
> >>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jco...@apache.org>
> wrote:
> >>>> >
> >>>> >> What happens if a job has been scaled out, but the underlying
> config is
> >>>> not
> >>>> >> updated to take that scaling into account? Would the next update
> on that
> >>>> >> job revert the number of instances (presumably, because what else
> could
> >>>> we
> >>>> >> do)? Is there anything we can do, tooling-wise, to improve upon
> this?
> >>>> >>
> >>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
> ma...@apache.org>
> >>>> >> wrote:
> >>>> >>
> >>>> >> > Our rolling update APIs can be quite inconvenient to work with
> when it
> >>>> >> > comes to instance scaling [1]. It's especially frustrating when
> >>>> >> > adding/removing instances has to be done in an automated fashion
> >>>> (e.g.:
> >>>> >> by
> >>>> >> > an external autoscaling process) as it requires holding on to the
> >>>> >> original
> >>>> >> > aurora config at all times.
> >>>> >> >
> >>>> >> > I propose we add simple instance scaling APIs to address the
> above.
> >>>> Since
> >>>> >> > Aurora job may have instances at different configs at any
> moment, I
> >>>> >> propose
> >>>> >> > we accept an InstanceKey as a reference point when scaling out.
> For
> >>>> >> > example:
> >>>> >> >
> >>>> >> >     /** Scales out a given job by adding more instances with the
> task
> >>>> >> > config of the templateKey. */
> >>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
> >>>> incrementCount)
> >>>> >> >
> >>>> >> >     /** Scales in a given job by removing existing instances. */
> >>>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> >>>> >> >
> >>>> >> > A correspondent client command could then look like:
> >>>> >> >
> >>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> >>>> >> >
> >>>> >> > For the above command, a scheduler would take task config of
> instance
> >>>> 1
> >>>> >> of
> >>>> >> > the 'hello' job and replicate it 10 more times thus adding 10
> >>>> additional
> >>>> >> > instances to the job.
> >>>> >> >
> >>>> >> > There are, of course, some details to work out like making sure
> no
> >>>> active
> >>>> >> > update is in flight, scale out does not violate quota and etc. I
> >>>> intend
> >>>> >> to
> >>>> >> > address those during the implementation as things progress.
> >>>> >> >
> >>>> >> > Does the above make sense? Any concerns/suggestions?
> >>>> >> >
> >>>> >> > Thanks,
> >>>> >> > Maxim
> >>>> >> >
> >>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >>>> >> >
> >>>> >>
> >>>>
>

Re: [PROPOSAL] Job instance scaling APIs

Reply via email to