+1 to addInstances On Tue, Jan 19, 2016 at 3:00 PM, Bill Farner <wfar...@apache.org> wrote:
> At risk of devolving the discussion, is it worth calling the method > addInstances as opposed to scaleOut? I find the former more descriptive. > > On Tue, Jan 19, 2016 at 11:12 AM, Maxim Khutornenko <ma...@apache.org> > wrote: > > > "Of course, the scaler could manually health check that all instances > > have come up and are being used as expected, but I guess that is what > > Aurora is for." > > > > I'd argue the updater "watch_secs" health checking isn't enough to > > ensure graceful rollout as instances may start flapping right after > > the updater signs off. Instances outside of update window may also > > flap (e.g. due to backend pressure) and updater will not be able to > > catch that. That's why a robust autoscaler has to rely on external > > monitoring tools and overall job health instead. > > > > A very basic approach, as you mentioned above, could be querying job > > status repeatedly and count the ratio of tasks in RUNNING vs active > > (ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to make > > a scaleOut decision. The more reliable approach though would also rely > > on external monitoring stats exposed by user processes. That would be > > a much higher fidelity signal than a decision based on task status > > alone. Scheduler does not (and should not for scalability reasons) > > have visibility into those stats, so the autoscaler would be in a much > > better position to make an executive decision there. > > > > On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan > > <stephan....@blue-yonder.com> wrote: > > > I believe the operation is not that simple when you look at the > > end-to-end scenario. > > > > > > For example, the implementation of an auto-scaler using the new > > scaleOut() API could look like: > > > > > > 1) check some KPI > > > 2) Infer an action based on this KPI such as scaleUp() or scaleDown() > > > 3) wait until the effects of the adjusted instance count is reflected > in > > the KPI. Go to 1 and repeat. > > > > > > The health checking capabilities of the existing updater (in particular > > together with [1]) would be really helpful here. Still, the simplified > > scaleOut() API would offer the great benefit that the auto-scaler would > not > > need to know about the used aurora configuration. > > > > > > We even had an incident with a sub-optimal implementation of step 3): > An > > overloaded package backend lead to slow service startups. The service > > startup took longer than the grace-period of our auto-scaler. It > therefore > > decided to add more and more instances, because the KPI wasn't improving > as > > expected. It had no way of knowing that these instances were not even > > 'running'. The additionally added instances aggravated the overload > > situation of the package backend. Of course, the scaler could manually > > health check that all instances have come up and are being used as > > expected, but I guess that is what Aurora is for. > > > > > > [1] > > > https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua > > > > > > Best Regards, > > > Stephan > > > ________________________________________ > > > From: Maxim Khutornenko <ma...@apache.org> > > > Sent: Friday, January 15, 2016 7:06 PM > > > To: dev@aurora.apache.org > > > Subject: Re: [PROPOSAL] Job instance scaling APIs > > > > > > I wasn't planning on using the rolling updater functionality given the > > > simplicity of the operation. I'd second Steve's earlier concerns about > > > scaleOut() looking more like startJobUpdate() if we keep adding > > > features. If health watching, throttling (batch_size) or rollback on > > > failure is required then I believe the startJobUpdate() should be used > > > instead of scaleOut(). > > > > > > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan > > > <stephan....@blue-yonder.com> wrote: > > >> I really like the proposal. The gain in simplicity on the client-side > > by not having to provide an aurora config is quite significant. > > >> > > >> The implementation on the scheduler side is probably rather straight > > forward as the update can be reused. That would also provide us with the > > update UI, which has shown to be quite useful when tracing autoscaler > > events. > > >> > > >> Regards, > > >> Stephan > > >> ________________________________________ > > >> From: Maxim Khutornenko <ma...@apache.org> > > >> Sent: Thursday, January 14, 2016 9:50 PM > > >> To: dev@aurora.apache.org > > >> Subject: Re: [PROPOSAL] Job instance scaling APIs > > >> > > >> "I'd be concerned that any > > >> scaling API to be powerful enough to fit all (most) use cases would > just > > >> end up looking like the update API." > > >> > > >> There is a big difference between scaleOut and startJobUpdate APIs > > >> that justifies the inclusion of the former. Namely, scaleOut may only > > >> replicate the existing instances without changing/introducing any new > > >> scheduling requirements or performing instance rollout/rollback. I > > >> don't see scaleOut ever becoming more powerful to threaten > > >> startJobUpdate. At the same time, the absence of aurora config > > >> requirement is a huge boost to autoscaling client simplification. > > >> > > >> "For example, when scaling down we don't just kill the last N > > instances, we > > >> actually look at the least loaded hosts (globally) and kill tasks from > > >> those." > > >> > > >> I don't quite see why the same wouldn't be possible with a scaleIn > > >> API. Isn't it always external process responsibility to pay due > > >> diligence before killing instances? > > >> > > >> > > >> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sniem...@apache.org> > > wrote: > > >>> As some background, we handle scale up / down purely from the client > > side, > > >>> using the update API for both directions. I'd be concerned that any > > >>> scaling API to be powerful enough to fit all (most) use cases would > > just > > >>> end up looking like the update API. > > >>> > > >>> For example, when scaling down we don't just kill the last N > > instances, we > > >>> actually look at the least loaded hosts (globally) and kill tasks > from > > >>> those. > > >>> > > >>> > > >>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org > > > > wrote: > > >>> > > >>>> "How is scaling down different from killing instances?" > > >>>> > > >>>> I found 'killTasks' syntax too different and way much more powerful > to > > >>>> be used for scaling in. The TaskQuery allows killing instances > across > > >>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job. > > >>>> Additional benefit: it can be ACLed independently by allowing > external > > >>>> process kill tasks only within a given job. We may also add rate > > >>>> limiting or backoff to it later. > > >>>> > > >>>> As for Joshua's question, I feel it should be an operator's > > >>>> responsibility to diff a job with its aurora config before applying > an > > >>>> update. That said, if there is enough demand we can definitely > > >>>> consider adding something similar to what George suggested or > > >>>> resurrecting a 'large change' warning message we used to have in > > >>>> client updater. > > >>>> > > >>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois < > geo...@tellapart.com > > > > > >>>> wrote: > > >>>> > As a point of reference, we solved this problem by adding a > binding > > >>>> helper > > >>>> > that queries the scheduler for the current number of instances and > > uses > > >>>> > that number instead of a hardcoded config: > > >>>> > > > >>>> > instances='{{scaling_instances[60]}}' > > >>>> > > > >>>> > In this example, instances will be set to the currently running > > number > > >>>> > (unless there are none, in which case 60 instances will be > created). > > >>>> > > > >>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jco...@apache.org> > > wrote: > > >>>> > > > >>>> >> What happens if a job has been scaled out, but the underlying > > config is > > >>>> not > > >>>> >> updated to take that scaling into account? Would the next update > > on that > > >>>> >> job revert the number of instances (presumably, because what else > > could > > >>>> we > > >>>> >> do)? Is there anything we can do, tooling-wise, to improve upon > > this? > > >>>> >> > > >>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko < > > ma...@apache.org> > > >>>> >> wrote: > > >>>> >> > > >>>> >> > Our rolling update APIs can be quite inconvenient to work with > > when it > > >>>> >> > comes to instance scaling [1]. It's especially frustrating when > > >>>> >> > adding/removing instances has to be done in an automated > fashion > > >>>> (e.g.: > > >>>> >> by > > >>>> >> > an external autoscaling process) as it requires holding on to > the > > >>>> >> original > > >>>> >> > aurora config at all times. > > >>>> >> > > > >>>> >> > I propose we add simple instance scaling APIs to address the > > above. > > >>>> Since > > >>>> >> > Aurora job may have instances at different configs at any > > moment, I > > >>>> >> propose > > >>>> >> > we accept an InstanceKey as a reference point when scaling out. > > For > > >>>> >> > example: > > >>>> >> > > > >>>> >> > /** Scales out a given job by adding more instances with > the > > task > > >>>> >> > config of the templateKey. */ > > >>>> >> > Response scaleOut(1: InstanceKey templateKey, 2: i32 > > >>>> incrementCount) > > >>>> >> > > > >>>> >> > /** Scales in a given job by removing existing instances. > */ > > >>>> >> > Response scaleIn(1: JobKey job, 2: i32 decrementCount) > > >>>> >> > > > >>>> >> > A correspondent client command could then look like: > > >>>> >> > > > >>>> >> > aurora job scale-out devcluster/vagrant/test/hello/1 10 > > >>>> >> > > > >>>> >> > For the above command, a scheduler would take task config of > > instance > > >>>> 1 > > >>>> >> of > > >>>> >> > the 'hello' job and replicate it 10 more times thus adding 10 > > >>>> additional > > >>>> >> > instances to the job. > > >>>> >> > > > >>>> >> > There are, of course, some details to work out like making sure > > no > > >>>> active > > >>>> >> > update is in flight, scale out does not violate quota and etc. > I > > >>>> intend > > >>>> >> to > > >>>> >> > address those during the implementation as things progress. > > >>>> >> > > > >>>> >> > Does the above make sense? Any concerns/suggestions? > > >>>> >> > > > >>>> >> > Thanks, > > >>>> >> > Maxim > > >>>> >> > > > >>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258 > > >>>> >> > > > >>>> >> > > >>>> > > >