Re: Review Request 60714: aurora job restart request should be idempotent and retryable

Mehrdad Nurolahzade Fri, 14 Jul 2017 21:05:32 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60714/#review180632
-----------------------------------------------------------




Looks legit. Just as Santhosh said let's make sure semantics of functionality 
relying on common logic is no affected.

With job restart command, going from "at most once" back to "at least once" can 
come at the price of restarting instances more than once. But, this is what we 
used to do prior to client retry semantic change. So, in general should be safe.

- Mehrdad Nurolahzade


On July 7, 2017, 4:52 p.m., Kai Huang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60714/
> -----------------------------------------------------------
> 
> (Updated July 7, 2017, 4:52 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh 
> Kumar Shanmugham, and Zameer Manji.
> 
> 
> Bugs: AURORA-1940
>     https://issues.apache.org/jira/browse/AURORA-1940
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> aurora job restart request should be idempotent and retryable.
> 
> There was a recent change to the Aurora client to provide "at most once" 
> instead of "at least once" retries for non-idempotent operations. See: 
> https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6
> 
> Technically, `aurora job restart` is a non-idempotent operation, thus it was 
> not retried. However, when a transport exception occurs, the operator has to 
> babysit simple operations like aurora job restart if it were not retried. 
> Compared to the requests that were causing problems (admin tasks, job 
> creating, updates, etc.), restarts in general should be retried rather than 
> erring on the side of caution.
> 
> Job restart can be divided into three steps:
> - 1. get instance status (getTasksWithoutConfigs)
> - 2. restart shards (restartShards)
> - 3. watch instance until healthy (getTasksWithoutConfigs)
> 
> TTransport exception can be thrown at each of these step, ideally we should 
> make __ALL__ of the steps above __idempotent__ and retryable. 
> 
> The only trickey part is that the `watch` logic is also used in --wait-until 
> options of job create/add command, making this step retryable will have an 
> impact on job create/add commands as well.
> 
> In this CR, I will make the first __TWO__ steps retryable since they are 
> self-contained in job restart command.
> If people are OK with this strategy, I'll make the `watch` step retryable as 
> well.
> 
> __Updates__:
> I made the `watch` step in `aurora job restart` retryable in the 3rd 
> revision. Note the `InstanceWatcher` in `Restarter` relies on 
> `StatusHelper.get_tasks` method to query the latest instance status. This 
> method is also invoked by JobMonitor during job create/add. 
> 
> The solution here is to pass a `retry` flag to this method, so that it can be 
> customized for different scenarios.
> 
> __Open Question__:
> Currently, `aurora job create --wait-until RUNNING` fails immediately if 
> there is any transport exception during the `wait` step. Do we need retry for 
> the `wait` step?
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/client/api/instance_watcher.py 
> a35fb22edc9a5bb12f286856b170f5cb0170eceb 
>   src/main/python/apache/aurora/client/api/restarter.py 
> 6600c6b608ee70a02e11ca8550a06b7fb76fd863 
>   src/main/python/apache/aurora/client/api/task_util.py 
> fb7c76f42171737701c740fddf893f07211a47a0 
>   src/test/python/apache/aurora/api_util.py 
> bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d 
>   src/test/python/apache/aurora/client/api/test_instance_watcher.py 
> 8fd419f812c6e038e6f7fba0a662da0a6410b794 
>   src/test/python/apache/aurora/client/api/test_job_monitor.py 
> 537abd38cd13bcbb615c0224c983868b29027379 
>   src/test/python/apache/aurora/client/api/test_restarter.py 
> a81003e79e090bbfa151431366f5394f405d99eb 
>   src/test/python/apache/aurora/client/api/test_task_util.py 
> 365ef59857003402c4354e06abeaea947d49322b 
>   src/test/python/apache/aurora/client/cli/test_command_hooks.py 
> a44a25fb1e4357780adca4403b6010d160066b21 
>   src/test/python/apache/aurora/client/cli/test_create.py 
> 3b09bb25e919bac2795ccd56bd98657b1f98690b 
>   src/test/python/apache/aurora/client/cli/test_plugins.py 
> 762735e00bd8051ab64674e7ee359758d3cbeca7 
>   src/test/python/apache/aurora/client/cli/test_restart.py 
> cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8 
> 
> 
> Diff: https://reviews.apache.org/r/60714/diff/3/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
>  
>  To test the retry logic, TTranport exceptions were randomly thrown at the 
> client side(for all api calls to the scheduler proxy, there is a 50% chance 
> of throwing TTranport exception), aurora job restart command was issued 
> against scheduler, and the output was like:
> ```
>  vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Performing rolling restart of job devcluster/vagrant/test/hello 
> (instances: [0])
>  INFO] Restarting instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Watching instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Detected RUNNING instance 0
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Instance 0 has been up and healthy for at least 30 seconds
>  INFO] All instances were restarted successfully
>  Job devcluster/vagrant/test/hello restarted successfully
> ```
> 
> 
> Thanks,
> 
> Kai Huang
> 
>

Re: Review Request 60714: aurora job restart request should be idempotent and retryable

Reply via email to