> On April 26, 2017, 1:20 a.m., David McLaughlin wrote: > > How is the affect of changes like this measured? Seems very hunch-driven, > > whereas other potential performance reviews were met with requests for > > methodology, etc. > > David McLaughlin wrote: > Also, generally good to have at least two Ship Its per review? Let's make > sure we follow that convention. > > Zameer Manji wrote: > I don't think this needs to be measured. Just consider the following: > 1. Thermos gives a task up to 60s to terminate. > 2. Once the process terminates thermos sends a `TASK_KILLED` to the > agent, which forwards this to Aurora. > 3. Aurora retries task kill every 5s, which means for a process that > takes any time to drain it will send up to 12 `TASK_KILL` messages while > waiting for the `TASK_KILLED` response. > 4. This change reduces the retries to 4, which makes far more sense to me. > > Our values (60s in Thermos) and (5s in Aurora) don't align at all. > > Agreed that a better explanation is needed here, and two ship its.
Understood. My feeling was that since this value could have been changed via the command line flag pre-commit, it was pretty easy to validate the change (or even bump it up to 30s as you suggested). - David ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58611/#review173008 ----------------------------------------------------------- On April 21, 2017, 10:36 a.m., Stephan Erb wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58611/ > ----------------------------------------------------------- > > (Updated April 21, 2017, 10:36 a.m.) > > > Review request for Aurora and Zameer Manji. > > > Repository: aurora > > > Description > ------- > > It is not very common that kills are dropped by Mesos and have to be retried > by Aurora. It therefore makes sense to slightly increase the retry timeout > so that we don't retry needlessly when Thermos is still busy executing > the lifecycle methods. > > By default, Thermos uses the following kill escalation sequence: > > * /quitquitquit > * wait 5s > * /abortabortabort > * wait 5s > * SIGTERM > * wait up to 1 minute > * SIGKILL > > > Diffs > ----- > > > src/main/java/org/apache/aurora/scheduler/reconciliation/ReconciliationModule.java > e076e802f8920b37cef202520c7fbe59724dd06d > > > Diff: https://reviews.apache.org/r/58611/diff/1/ > > > Testing > ------- > > ./gradlew -Pq build > > > Thanks, > > Stephan Erb > >
