"I'd be concerned that any scaling API to be powerful enough to fit all (most) use cases would just end up looking like the update API."
There is a big difference between scaleOut and startJobUpdate APIs that justifies the inclusion of the former. Namely, scaleOut may only replicate the existing instances without changing/introducing any new scheduling requirements or performing instance rollout/rollback. I don't see scaleOut ever becoming more powerful to threaten startJobUpdate. At the same time, the absence of aurora config requirement is a huge boost to autoscaling client simplification. "For example, when scaling down we don't just kill the last N instances, we actually look at the least loaded hosts (globally) and kill tasks from those." I don't quite see why the same wouldn't be possible with a scaleIn API. Isn't it always external process responsibility to pay due diligence before killing instances? On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sniem...@apache.org> wrote: > As some background, we handle scale up / down purely from the client side, > using the update API for both directions. I'd be concerned that any > scaling API to be powerful enough to fit all (most) use cases would just > end up looking like the update API. > > For example, when scaling down we don't just kill the last N instances, we > actually look at the least loaded hosts (globally) and kill tasks from > those. > > > On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org> wrote: > >> "How is scaling down different from killing instances?" >> >> I found 'killTasks' syntax too different and way much more powerful to >> be used for scaling in. The TaskQuery allows killing instances across >> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job. >> Additional benefit: it can be ACLed independently by allowing external >> process kill tasks only within a given job. We may also add rate >> limiting or backoff to it later. >> >> As for Joshua's question, I feel it should be an operator's >> responsibility to diff a job with its aurora config before applying an >> update. That said, if there is enough demand we can definitely >> consider adding something similar to what George suggested or >> resurrecting a 'large change' warning message we used to have in >> client updater. >> >> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <geo...@tellapart.com> >> wrote: >> > As a point of reference, we solved this problem by adding a binding >> helper >> > that queries the scheduler for the current number of instances and uses >> > that number instead of a hardcoded config: >> > >> > instances='{{scaling_instances[60]}}' >> > >> > In this example, instances will be set to the currently running number >> > (unless there are none, in which case 60 instances will be created). >> > >> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jco...@apache.org> wrote: >> > >> >> What happens if a job has been scaled out, but the underlying config is >> not >> >> updated to take that scaling into account? Would the next update on that >> >> job revert the number of instances (presumably, because what else could >> we >> >> do)? Is there anything we can do, tooling-wise, to improve upon this? >> >> >> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org> >> >> wrote: >> >> >> >> > Our rolling update APIs can be quite inconvenient to work with when it >> >> > comes to instance scaling [1]. It's especially frustrating when >> >> > adding/removing instances has to be done in an automated fashion >> (e.g.: >> >> by >> >> > an external autoscaling process) as it requires holding on to the >> >> original >> >> > aurora config at all times. >> >> > >> >> > I propose we add simple instance scaling APIs to address the above. >> Since >> >> > Aurora job may have instances at different configs at any moment, I >> >> propose >> >> > we accept an InstanceKey as a reference point when scaling out. For >> >> > example: >> >> > >> >> > /** Scales out a given job by adding more instances with the task >> >> > config of the templateKey. */ >> >> > Response scaleOut(1: InstanceKey templateKey, 2: i32 >> incrementCount) >> >> > >> >> > /** Scales in a given job by removing existing instances. */ >> >> > Response scaleIn(1: JobKey job, 2: i32 decrementCount) >> >> > >> >> > A correspondent client command could then look like: >> >> > >> >> > aurora job scale-out devcluster/vagrant/test/hello/1 10 >> >> > >> >> > For the above command, a scheduler would take task config of instance >> 1 >> >> of >> >> > the 'hello' job and replicate it 10 more times thus adding 10 >> additional >> >> > instances to the job. >> >> > >> >> > There are, of course, some details to work out like making sure no >> active >> >> > update is in flight, scale out does not violate quota and etc. I >> intend >> >> to >> >> > address those during the implementation as things progress. >> >> > >> >> > Does the above make sense? Any concerns/suggestions? >> >> > >> >> > Thanks, >> >> > Maxim >> >> > >> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258 >> >> > >> >> >>