The API change is basically what I had imagined. +1 I really imagined a scaling down command as a wrapper to job kill. Although another approach would be to augment the current job kill command to kill the last N instance of a job.
i.e. aurora job kill devcluster/vagrant/test/hello 10 On Fri, Jan 15, 2016 at 10:06 AM, Maxim Khutornenko <ma...@apache.org> wrote: > I wasn't planning on using the rolling updater functionality given the > simplicity of the operation. I'd second Steve's earlier concerns about > scaleOut() looking more like startJobUpdate() if we keep adding > features. If health watching, throttling (batch_size) or rollback on > failure is required then I believe the startJobUpdate() should be used > instead of scaleOut(). > > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan > <stephan....@blue-yonder.com> wrote: > > I really like the proposal. The gain in simplicity on the client-side by > not having to provide an aurora config is quite significant. > > > > The implementation on the scheduler side is probably rather straight > forward as the update can be reused. That would also provide us with the > update UI, which has shown to be quite useful when tracing autoscaler > events. > > > > Regards, > > Stephan > > ________________________________________ > > From: Maxim Khutornenko <ma...@apache.org> > > Sent: Thursday, January 14, 2016 9:50 PM > > To: dev@aurora.apache.org > > Subject: Re: [PROPOSAL] Job instance scaling APIs > > > > "I'd be concerned that any > > scaling API to be powerful enough to fit all (most) use cases would just > > end up looking like the update API." > > > > There is a big difference between scaleOut and startJobUpdate APIs > > that justifies the inclusion of the former. Namely, scaleOut may only > > replicate the existing instances without changing/introducing any new > > scheduling requirements or performing instance rollout/rollback. I > > don't see scaleOut ever becoming more powerful to threaten > > startJobUpdate. At the same time, the absence of aurora config > > requirement is a huge boost to autoscaling client simplification. > > > > "For example, when scaling down we don't just kill the last N instances, > we > > actually look at the least loaded hosts (globally) and kill tasks from > > those." > > > > I don't quite see why the same wouldn't be possible with a scaleIn > > API. Isn't it always external process responsibility to pay due > > diligence before killing instances? > > > > > > On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sniem...@apache.org> > wrote: > >> As some background, we handle scale up / down purely from the client > side, > >> using the update API for both directions. I'd be concerned that any > >> scaling API to be powerful enough to fit all (most) use cases would just > >> end up looking like the update API. > >> > >> For example, when scaling down we don't just kill the last N instances, > we > >> actually look at the least loaded hosts (globally) and kill tasks from > >> those. > >> > >> > >> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org> > wrote: > >> > >>> "How is scaling down different from killing instances?" > >>> > >>> I found 'killTasks' syntax too different and way much more powerful to > >>> be used for scaling in. The TaskQuery allows killing instances across > >>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job. > >>> Additional benefit: it can be ACLed independently by allowing external > >>> process kill tasks only within a given job. We may also add rate > >>> limiting or backoff to it later. > >>> > >>> As for Joshua's question, I feel it should be an operator's > >>> responsibility to diff a job with its aurora config before applying an > >>> update. That said, if there is enough demand we can definitely > >>> consider adding something similar to what George suggested or > >>> resurrecting a 'large change' warning message we used to have in > >>> client updater. > >>> > >>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <geo...@tellapart.com> > >>> wrote: > >>> > As a point of reference, we solved this problem by adding a binding > >>> helper > >>> > that queries the scheduler for the current number of instances and > uses > >>> > that number instead of a hardcoded config: > >>> > > >>> > instances='{{scaling_instances[60]}}' > >>> > > >>> > In this example, instances will be set to the currently running > number > >>> > (unless there are none, in which case 60 instances will be created). > >>> > > >>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jco...@apache.org> > wrote: > >>> > > >>> >> What happens if a job has been scaled out, but the underlying > config is > >>> not > >>> >> updated to take that scaling into account? Would the next update on > that > >>> >> job revert the number of instances (presumably, because what else > could > >>> we > >>> >> do)? Is there anything we can do, tooling-wise, to improve upon > this? > >>> >> > >>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko < > ma...@apache.org> > >>> >> wrote: > >>> >> > >>> >> > Our rolling update APIs can be quite inconvenient to work with > when it > >>> >> > comes to instance scaling [1]. It's especially frustrating when > >>> >> > adding/removing instances has to be done in an automated fashion > >>> (e.g.: > >>> >> by > >>> >> > an external autoscaling process) as it requires holding on to the > >>> >> original > >>> >> > aurora config at all times. > >>> >> > > >>> >> > I propose we add simple instance scaling APIs to address the > above. > >>> Since > >>> >> > Aurora job may have instances at different configs at any moment, > I > >>> >> propose > >>> >> > we accept an InstanceKey as a reference point when scaling out. > For > >>> >> > example: > >>> >> > > >>> >> > /** Scales out a given job by adding more instances with the > task > >>> >> > config of the templateKey. */ > >>> >> > Response scaleOut(1: InstanceKey templateKey, 2: i32 > >>> incrementCount) > >>> >> > > >>> >> > /** Scales in a given job by removing existing instances. */ > >>> >> > Response scaleIn(1: JobKey job, 2: i32 decrementCount) > >>> >> > > >>> >> > A correspondent client command could then look like: > >>> >> > > >>> >> > aurora job scale-out devcluster/vagrant/test/hello/1 10 > >>> >> > > >>> >> > For the above command, a scheduler would take task config of > instance > >>> 1 > >>> >> of > >>> >> > the 'hello' job and replicate it 10 more times thus adding 10 > >>> additional > >>> >> > instances to the job. > >>> >> > > >>> >> > There are, of course, some details to work out like making sure no > >>> active > >>> >> > update is in flight, scale out does not violate quota and etc. I > >>> intend > >>> >> to > >>> >> > address those during the implementation as things progress. > >>> >> > > >>> >> > Does the above make sense? Any concerns/suggestions? > >>> >> > > >>> >> > Thanks, > >>> >> > Maxim > >>> >> > > >>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258 > >>> >> > > >>> >> > >>> >