> On June 2, 2017, 10:28 a.m., Zameer Manji wrote: > > I don't have comment access on the doc, so I will leave my questions here: > > > > 1. Should operators (via executor flags) be able to cap this value? That is > > should `STOP_TIMEOUT` be an operator flag? > > 2. Should the client cap this value and display an error message to the > > user? > > 3. AFAIK, a task can remain in `KILLING` forever. There is no timeout in > > the scheduler, as it just retries kills. If a user puts a large value here, > > I'm not sure tasks will actually termiante. Please add an e2e test here to > > confirm/deny. > > David McLaughlin wrote: > For (3), this is exactly what STOP_TIMEOUT in the executor is for. > > The issue for STOP_TIMEOUT is it is Thermos-specific and we support > multiple executors. > > David McLaughlin wrote: > Sorry to be clear: for (1), the issue is we support multiple executors. > So we'd need some generic way of passing parameters from the scheduler to the > executor. Or to be honest, I don't think the Scheduler should really get to > make this decision. What we need is some way of overriding all the magic > strings and numbers littered throughout Thermos. But I think this is a > separate ticket. All we're really doing here is bumping up a constant timeout > from 2 minutes to 5 minutes. > > Jordan Ly wrote: > On (3), David is correct: STOP_TIMEOUT is essentially a hard limit on how > long the KILLING process can go on for. An e2e test might be disruptive since > you would have to wait the full 5 minutes for the timeout to be hit --- > unless there is another way I am missing :) > > Could you elaborate on what you mean in (2)?
For (1) we do have a generic way of passing parameters from the scheduler to the executor. It is called `thermos_executor_flags` for legacy reasons but it can pass arbitrary arguments to the executor. (3) Thanks for clearing up what `STOP_TIMEOUT` does. I have no objection to adding a 5m e2e test. I think capturing this edge case behaviour would be worth while. I was just thinking here that since users can specify a value here, it would be nice to control what the cap is. In the linked design it mentions a 10s long shutdown, but I can think of valid cases where this value should be 2m or 6m. It would be nice to add a thermos flag `--maximum_wait_escalation_time` that one could set in the scheduler `thermos_executor_flags` I don't feel too strong about this flag as I could just add it later. However, I do feel that an e2e test of some kind here is required to ensure we never regress. - Zameer ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/59733/#review176802 ----------------------------------------------------------- On June 1, 2017, 4:48 p.m., Jordan Ly wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/59733/ > ----------------------------------------------------------- > > (Updated June 1, 2017, 4:48 p.m.) > > > Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, > Stephan Erb, and Zameer Manji. > > > Bugs: AURORA-1931 > https://issues.apache.org/jira/browse/AURORA-1931 > > > Repository: aurora > > > Description > ------- > > We have some services that require more than the current 10 seconds given to > gracefully shutdown (they need to close resources, finish requests, etc). > > We would like to be able to configure the amount of time we wait between each > stage of the graceful shutdown sequence. See this > [proposal](https://docs.google.com/document/d/1Sl-KWNyt1j0nIndinqfJsH3pkUY5IYXfGWyLHU2wacs/edit?usp=sharing) > for a more in-depth > analysis. > > > Diffs > ----- > > src/main/python/apache/aurora/config/schema/base.py > b2692a648645a195a24491e4978fb833c6c20be8 > src/main/python/apache/aurora/executor/aurora_executor.py > 81461cb49ac223f3bdfa59e8c59e150a07771dea > src/main/python/apache/aurora/executor/http_lifecycle.py > 9280bf29da9bda1691adbf3a4c34c4f3d4900517 > src/test/python/apache/aurora/client/cli/test_inspect.py > 4a23c5984c2d093e2f53e93aec71418f84b65928 > src/test/python/apache/aurora/executor/test_http_lifecycle.py > a967e3410a4d2dc2e1721f505a4d76da9209d177 > src/test/python/apache/aurora/executor/test_thermos_task_runner.py > 1b92667bceabc8ea1540122477a51cb58ea2ae36 > > > Diff: https://reviews.apache.org/r/59733/diff/1/ > > > Testing > ------- > > Ran unit and integration tests. > > Created and killed jobs with varying wait_escalation_secs values on the > Vagrant devcluster. > > > Thanks, > > Jordan Ly > >