On 7 April 2015 at 21:28, Erb, Stephan <[email protected]> wrote:
> Brian, do you have any particular plans regarding your shutdown > requirements? I have seen that you have filed another issue [1] which is > also concerned with graceful shutdown. > Given this thread, I now only wish to hit a different endpoint than /quitquitquit (and I may aswell do /abortabortabort while I'm at it). The rest is changes to our internal shutdown handling. > Stephan > > PS: For what it's worth, I implemented the 'quick fix' version to my > problem stated in the beginning of this thread [2]. > That's handy. When writing the code up today I noticed that hitting /quitquitquit wasn't unittested. I hope to have that up for review tomorrow with unittests, which you could build on to do a more end-to-end unittest for your code. Brian > [1] https://issues.apache.org/jira/browse/AURORA-1257 > [2] https://reviews.apache.org/r/32889/ > > ________________________________________ > From: Brian Brazil <[email protected]> > Sent: Tuesday, March 24, 2015 10:48 PM > To: [email protected] > Subject: Re: Graceful task shutdown > > On 24 March 2015 at 21:33, George Sirois <[email protected]> wrote: > > > Unfortunately I don't think my change will be able to make it in as-is. > > > > As Brian Wickman pointed out, it could introduce serious problems because > > there are varying timeouts across the scheduler/executor, so if you set > > your wait time to be too high, the scheduler might start to consider the > > tasks lost because they stayed in the transient KILLING state for too > long. > > > > Hmm, what sort of work is involved in resolving that? > > In my case I need at least 12s after the /qqq before sending the TERM. > > Brian > > > > > > I do think the lifecycle modules idea would solve Stephan's issue. > > > > On Tue, Mar 24, 2015 at 5:06 PM, Brian Brazil <[email protected]> > > wrote: > > > > > On 24 March 2015 at 20:57, Erb, Stephan <[email protected]> > > > wrote: > > > > > > > Hi everyone, > > > > > > > > we are implementing the /health endpoint in our services but omit the > > > > implementation of the unauthenticated lifecycle methods /quitquitquit > > and > > > > /abortabortabort. > > > > > > > > As a consequence, stopping a service is taxed by 10 seconds waiting > > time > > > > [1]. I would like to get rid of this unnecessary delay and can think > of > > > two > > > > solutions: > > > > > > > > a) Only perform the escalation wait when the http_signaler reports > that > > > > the message could be delivered to the service. This is a rather > simple > > > and > > > > localized fix. > > > > > > > > b) Use another port for lifecycle events. This would require a new > > > > addition to the task configuration and proper plumbing throughout the > > > rest > > > > of the system. Backward compatibility could be achieved by using > > 'health' > > > > as the default lifecycle management port. > > > > > > > > Any thoughts? I would be happy with the simple solution, but in the > end > > > > it's your call :-) > > > > > > > > > > __george mentioned on IRC working on a change that'll let the wait time > > be > > > configurable (which is something I also need), would that cover your > use > > > case? > > > > > > There were also discussions on IRC about custom lifecycle modules. > > > > > > Brian > > > > > > > > > > > > > > Best Regards, > > > > Stephan > > > > > > > > [1] > > > > > > > > > > https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/executor/thermos_task_runner.py#L123 > > > > > >
