Re: Review Request 58768: AURORA-1924 Aurora client should reconcile idempotent job creations

David McLaughlin Wed, 03 May 2017 09:10:57 -0700


> On April 27, 2017, 7:06 p.m., Mehrdad Nurolahzade wrote:
> > This does not work as intended in presence of multiple clients.
> > 
> > Example timeline:
> > 
> > Client 1|Client 2|Request|Scheduler|Response
> > --------|--------|-------|---------|--------
> > Create J|        |OK     |Created  |FAIL      // Delivering response 
> > failed, client 1 will retry after 5 seconds
> >         |Kill J  |OK     |Killed   |OK        // Client 2 successfully 
> > killed J
> > Create J|        |OK     |Created  |OK        // Client 1 will conclude 
> > that it has successfully created J while the global state has been 
> > comprimised.
> 
> Mehrdad Nurolahzade wrote:
>     Reflecting on the behavior change introduced by this patch, I am no 
> longer concerned. Here is the justification.
>     
>     In the multi-client world of Aurora where clients can concurrently access 
> scheduler and submit requests over unreliable communication channels, one of 
> the following four situations can happen when it comes to job creation:
>     
>     1. **One request**: job create request is received, processed, and 
> response is delivered to client. Request is successful if key does not exist, 
> and is failed otherwise (`ResponseCode.INVALID_REQUEST` with no 
> `JobCreateResult`).
>     2. **One request, retried**: job create request/response is not 
> received/delivered, client retries request after 5 seconds. If it was 
> received the first time it is softly rejected this time 
> (`ResponseCode.INVALID_REQUEST` with a `JobCreateResult`). If it was not 
> received the first time, it is processed this time and job is either created 
> or request fails (case 1 above).
>     3. **Two requests, read-only operation in between**: job create 
> request/response is not received/delivered, client one retries request after 
> 5 seconds, scheduler handles a read-only operation from client two associated 
> with the same job in between the two requests from client one. The concern 
> here is that the client two might make a decision based on the state of the 
> job that might no longer be valid after the retry from client one. But, this 
> also happens today. Aurora does not provide atomic CAS operation support, 
> therefore there is no gurantee that scheduler state does not change in 
> between a read and the follow-up mutable operation.
>     4. **Two request, mutable operation in between**: job create 
> request/response is not received/delivered, client one retries request after 
> 5 seconds, scheduler handles a mutable operation from client two associated 
> with the same job in between the two requests from client one (the scenario 
> depicted in my previous comment). The concern here is that client two might 
> make a decision based on the modification it just made to the state of the 
> job that might no longer be valid after the retry from client one. Again, 
> this is the same behavior that exists today. Aurora does not support 
> multi-operation transactions, therefore, scheduler state can change in 
> between a mutable operation and follow-up read-only or mutable operations.
>     
>     I believe we can review and accept this patch.


Now that we shipped the change to not automatically retry job create - is this 
still necessary?


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58768/#review173237
-----------------------------------------------------------


On May 3, 2017, 4:33 a.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58768/
> -----------------------------------------------------------
> 
> (Updated May 3, 2017, 4:33 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1924
>     https://issues.apache.org/jira/browse/AURORA-1924
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Aurora scheduler rejects a request to create a job if a job with the same key 
> already exists. Aurora client exits with an error once it receives a response 
> with `ResponseCode.INVALID_REQUEST` from scheduler in this case.
> 
> However, an attempt to create a job with the exact same configuration and 
> number of instances is essentially idempotent. Scheduler can detect this 
> situation, ignore it, and signal client to treat operation as successful; 
> client warns user about existing job but does not fail the operation.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 3749531b5412d7ca217736aa85eed8e6606225ad 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  059fbb86a575f5b3d78a63c9a7b5a9eebb6cb3ae 
>   src/main/python/apache/aurora/client/cli/jobs.py 
> b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
>  016859ca3bf83f64d2576b4c7109729770f9e25c 
>   src/test/python/apache/aurora/client/cli/test_create.py 
> 3b09bb25e919bac2795ccd56bd98657b1f98690b 
> 
> 
> Diff: https://reviews.apache.org/r/58768/diff/1/
> 
> 
> Testing
> -------
> 
> - Manually under Vagrant
> - End to end test script
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>

Re: Review Request 58768: AURORA-1924 Aurora client should reconcile idempotent job creations

Reply via email to