Re: Review Request 58611: Bump initial_task_kill_retry_interval to 15s.

David McLaughlin Tue, 25 Apr 2017 18:41:43 -0700


> On April 26, 2017, 1:20 a.m., David McLaughlin wrote:
> > How is the affect of changes like this measured? Seems very hunch-driven, 
> > whereas other potential performance reviews were met with requests for 
> > methodology, etc.
> 
> David McLaughlin wrote:
>     Also, generally good to have at least two Ship Its per review? Let's make 
> sure we follow that convention.
> 
> Zameer Manji wrote:
>     I don't think this needs to be measured. Just consider the following:
>     1. Thermos gives a task up to 60s to terminate.
>     2. Once the process terminates thermos sends a `TASK_KILLED` to the 
> agent, which forwards this to Aurora.
>     3. Aurora retries task kill every 5s, which means for a process that 
> takes any time to drain it will send up to 12 `TASK_KILL` messages while 
> waiting for the `TASK_KILLED` response.
>     4. This change reduces the retries to 4, which makes far more sense to me.
>     
>     Our values (60s in Thermos) and (5s in Aurora) don't align at all.
>     
>     Agreed that a better explanation is needed here, and two ship its.


Understood. My feeling was that since this value could have been changed via 
the command line flag pre-commit, it was pretty easy to validate the change (or 
even bump it up to 30s as you suggested).


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58611/#review173008
-----------------------------------------------------------


On April 21, 2017, 10:36 a.m., Stephan Erb wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58611/
> -----------------------------------------------------------
> 
> (Updated April 21, 2017, 10:36 a.m.)
> 
> 
> Review request for Aurora and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> It is not very common that kills are dropped by Mesos and have to be retried
> by Aurora. It therefore makes sense to slightly increase the retry timeout
> so that we don't retry needlessly when Thermos is still busy executing
> the lifecycle methods.
> 
> By default, Thermos uses the following kill escalation sequence:
> 
>   * /quitquitquit
>   * wait 5s
>   * /abortabortabort
>   * wait 5s
>   * SIGTERM
>   * wait up to 1 minute
>   * SIGKILL
> 
> 
> Diffs
> -----
> 
>   
> src/main/java/org/apache/aurora/scheduler/reconciliation/ReconciliationModule.java
>  e076e802f8920b37cef202520c7fbe59724dd06d 
> 
> 
> Diff: https://reviews.apache.org/r/58611/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>

Re: Review Request 58611: Bump initial_task_kill_retry_interval to 15s.

Reply via email to