Re: Review Request 58768: AURORA-1924 Aurora client should reconcile idempotent job creations

Mehrdad Nurolahzade Wed, 03 May 2017 10:36:29 -0700


> On April 27, 2017, 12:06 p.m., Mehrdad Nurolahzade wrote:
> > This does not work as intended in presence of multiple clients.
> > 
> > Example timeline:
> > 
> > Client 1|Client 2|Request|Scheduler|Response
> > --------|--------|-------|---------|--------
> > Create J|        |OK     |Created  |FAIL      // Delivering response 
> > failed, client 1 will retry after 5 seconds
> >         |Kill J  |OK     |Killed   |OK        // Client 2 successfully 
> > killed J
> > Create J|        |OK     |Created  |OK        // Client 1 will conclude 
> > that it has successfully created J while the global state has been 
> > comprimised.
> 
> Mehrdad Nurolahzade wrote:
>     Reflecting on the behavior change introduced by this patch, I am no 
> longer concerned. Here is the justification.
>     
>     In the multi-client world of Aurora where clients can concurrently access 
> scheduler and submit requests over unreliable communication channels, one of 
> the following four situations can happen when it comes to job creation:
>     
>     1. **One request**: job create request is received, processed, and 
> response is delivered to client. Request is successful if key does not exist, 
> and is failed otherwise (`ResponseCode.INVALID_REQUEST` with no 
> `JobCreateResult`).
>     2. **One request, retried**: job create request/response is not 
> received/delivered, client retries request after 5 seconds. If it was 
> received the first time it is softly rejected this time 
> (`ResponseCode.INVALID_REQUEST` with a `JobCreateResult`). If it was not 
> received the first time, it is processed this time and job is either created 
> or request fails (case 1 above).
>     3. **Two requests, read-only operation in between**: job create 
> request/response is not received/delivered, client one retries request after 
> 5 seconds, scheduler handles a read-only operation from client two associated 
> with the same job in between the two requests from client one. The concern 
> here is that the client two might make a decision based on the state of the 
> job that might no longer be valid after the retry from client one. But, this 
> also happens today. Aurora does not provide atomic CAS operation support, 
> therefore there is no gurantee that scheduler state does not change in 
> between a read and the follow-up mutable operation.
>     4. **Two request, mutable operation in between**: job create 
> request/response is not received/delivered, client one retries request after 
> 5 seconds, scheduler handles a mutable operation from client two associated 
> with the same job in between the two requests from client one (the scenario 
> depicted in my previous comment). The concern here is that client two might 
> make a decision based on the modification it just made to the state of the 
> job that might no longer be valid after the retry from client one. Again, 
> this is the same behavior that exists today. Aurora does not support 
> multi-operation transactions, therefore, scheduler state can change in 
> between a mutable operation and follow-up read-only or mutable operations.
>     
>     I believe we can review and accept this patch.
> 
> David McLaughlin wrote:
>     Now that we shipped the change to not automatically retry job create - is 
> this still necessary?
> 
> Mehrdad Nurolahzade wrote:
>     It provides slightly better user experience, but we can live without it.
>     
>     **Without this patch**: When a job create fails, it's not automatically 
> retried, therefore client has to either query scheduler state to verify state 
> or keep resubmitting command until it either succeeds (job is created) or 
> fails (job already exists).
>     
>     **With this patch**: When a job create fails, it is autoamtically retried 
> (assuming that we turn on `retry=True` argument on `createJob()` client API), 
> and ultimately succeeds whether it creates the job or learns that it has been 
> created in one of the previous tries.


**I am discarding this patch**

We spoke offline and agreed that the burden of reconciling state should be 
pushed to the client following all transport layer exceptions; a job create 
operation should be considered a failure even when a job with the exact same 
configuration and number instances already exists.


- Mehrdad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58768/#review173237
-----------------------------------------------------------


On May 2, 2017, 9:33 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58768/
> -----------------------------------------------------------
> 
> (Updated May 2, 2017, 9:33 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1924
>     https://issues.apache.org/jira/browse/AURORA-1924
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Aurora scheduler rejects a request to create a job if a job with the same key 
> already exists. Aurora client exits with an error once it receives a response 
> with `ResponseCode.INVALID_REQUEST` from scheduler in this case.
> 
> However, an attempt to create a job with the exact same configuration and 
> number of instances is essentially idempotent. Scheduler can detect this 
> situation, ignore it, and signal client to treat operation as successful; 
> client warns user about existing job but does not fail the operation.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 3749531b5412d7ca217736aa85eed8e6606225ad 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  059fbb86a575f5b3d78a63c9a7b5a9eebb6cb3ae 
>   src/main/python/apache/aurora/client/cli/jobs.py 
> b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
>  016859ca3bf83f64d2576b4c7109729770f9e25c 
>   src/test/python/apache/aurora/client/cli/test_create.py 
> 3b09bb25e919bac2795ccd56bd98657b1f98690b 
> 
> 
> Diff: https://reviews.apache.org/r/58768/diff/1/
> 
> 
> Testing
> -------
> 
> - Manually under Vagrant
> - End to end test script
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>

Re: Review Request 58768: AURORA-1924 Aurora client should reconcile idempotent job creations

Reply via email to