Definitely +1 on the idea of a safeguard.
I didn't really have any proposals outside of the ones that have been
mentioned in this thread already.

W.R.T. automatic scaling up the instances via a binding helper (George
talked about it in the scaling API discussion).
Essentially we had a binding helper which talks to the scheduler and
figures out how many active task instances. Similar to what Stephan had
brought up.
However, we've still had incidents where an engineer has forgotten to or
accidentally did not use the binding helper.
I think to avoid operator error, I'd like whatever safeguard be
automatically activated, rather than requiring an explicit flag.


On Mon, Feb 8, 2016 at 9:42 AM, Maxim Khutornenko <ma...@apache.org> wrote:

> > Or without any persistence at all.  The client could refuse to adjust the
> > instance count on a job unless there's additional command line argument.
> > The same arguments of responsibility could be said here of users of old
> > clients or custom clients.
>
> Bill, are you suggesting 'aurora update start' client command call a
> scheduler to acquire an update diff first and block startJobUpdate RPC
> call unless a special command line flag is present?
>
> > When updating a job, the scheduler would fill in the current instance
> count.
> > However, when I want to change the number of instances, I could simply
> > bind another value locally when triggering the update.
>
> Stephan, this sounds like increasing instances would also require a
> binding helper, which makes an update process less deterministic (i.e.
> .aurora config file is no longer self-contained).
>
> On Sun, Feb 7, 2016 at 3:02 PM, Erb, Stephan
> <stephan....@blue-yonder.com> wrote:
> > A related idea that recently crossed my mind was some kind of pystachio
> variable / binding helper:  {{aurora.instances}}.
> >
> > When updating a job, the scheduler would fill in the current instance
> count. However, when I want to change the number of instances, I could
> simply bind another value locally when triggering the update.
> > ________________________________________
> > From: Maxim Khutornenko <ma...@apache.org>
> > Sent: Saturday, February 6, 2016 00:07
> > To: dev@aurora.apache.org
> > Subject: Re: [PROPOSAL] Disallow instance removal in job update
> >
> > We have had attempts to safeguard client updater command with a
> > "dangerous change" warning before but it did not get good feedback.
> > Besides, automated tools/scripts just ignored it.
> >
> > An alternative could be what George suggest on the scaling API thread
> > mentioned earlier: automatically bump up instance count to the job
> > active task count. I'd say this could be an implementation to the
> > proposal above rather than a safeguard as it accomplishes the exact
> > same goal.
> >
> > Bill, do you have any ideas of what that safeguard could be?
> >
> > On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner <wfar...@apache.org> wrote:
> >>>
> >>> the outdated instance count problem will only get worse as automated
> >>> scaling tools will quickly render existing .aurora config value
> obsolete
> >>
> >>
> >> This is not a compelling reason to remove functionality.  Sounds like a
> >> safeguard is needed instead.
> >>
> >> On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
> >>
> >>> This is mostly a survey rather than a proposal. How would people think
> >>> about limiting updater to only adding/updating instances and let
> >>> killTasks take care of instance removals?
> >>>
> >>> We have all heard stories (or happen to create some ourselves) when an
> >>> outdated instance count value in .aurora config caused unexpected
> >>> instance removals. Granted, there are plenty of other values in the
> >>> config that can cause service-wide outage but instance count seems to
> >>> be the worst in that sense.
> >>>
> >>> After the recent refactoring of addInstances and killTasks to act as
> >>> scaleOut/scaleIn APIs [1], the outdated instance count problem will
> >>> only get worse as automated scaling tools will quickly render existing
> >>> .aurora config value obsolete. With that in mind, should we block
> >>> instance removal in the updater and let an explicit killTasks call be
> >>> the only acceptable action to reduce instance count? Is there any
> >>> value (aside from arguable convenience factor) in having
> >>> startJobUpdate ever killing instances?
> >>>
> >>> Thanks,
> >>> Maxim
> >>>
> >>> [1] - http://markmail.org/message/2smaej5n5e54li3g
> >>>
>

Reply via email to