Re: Review Request 58768: AURORA-1924 Aurora client should reconcile idempotent job creations

Mehrdad Nurolahzade Wed, 03 May 2017 08:56:44 -0700


> On April 27, 2017, 12:06 p.m., Mehrdad Nurolahzade wrote:
> > This does not work as intended in presence of multiple clients.
> > 
> > Example timeline:
> > 
> > Client 1|Client 2|Request|Scheduler|Response
> > --------|--------|-------|---------|--------
> > Create J|        |OK     |Created  |FAIL      // Delivering response 
> > failed, client 1 will retry after 5 seconds
> >         |Kill J  |OK     |Killed   |OK        // Client 2 successfully 
> > killed J
> > Create J|        |OK     |Created  |OK        // Client 1 will conclude 
> > that it has successfully created J while the global state has been 
> > comprimised.

Reflecting on the behavior change introduced by this patch, I am no longer 
concerned. Here is the justification.

In the multi-client world of Aurora where clients can concurrently access 
scheduler and submit requests over unreliable communication channels, one of 
the following four situations can happen when it comes to job creation:

1. **One request**: job create request is received, processed, and response is 
delivered to client. Request is successful if key does not exist, and is failed 
otherwise (`ResponseCode.INVALID_REQUEST` with no `JobCreateResult`).
2. **One request, retried**: job create request/response is not 
received/delivered, client retries request after 5 seconds. If it was received 
the first time it is softly rejected this time (`ResponseCode.INVALID_REQUEST` 
with a `JobCreateResult`). If it was not received the first time, it is 
processed this time and job is either created or request fails (case 1 above).
3. **Two requests, read-only operation in between**: job create 
request/response is not received/delivered, client one retries request after 5 
seconds, scheduler handles a read-only operation from client two associated 
with the same job in between the two requests from client one. The concern here 
is that the client two might make a decision based on the state of the job that 
might no longer be valid after the retry from client one. But, this also 
happens today. Aurora does not provide atomic CAS operation support, therefore 
there is no gurantee that scheduler state does not change in between a read and 
the follow-up mutable operation.
4. **Two request, mutable operation in between**: job create request/response 
is not received/delivered, client one retries request after 5 seconds, 
scheduler handles a mutable operation from client two associated with the same 
job in between the two requests from client one (the scenario depicted in my 
previous comment). The concern here is that client two might make a decision 
based on the modification it just made to the state of the job that might no 
longer be valid after the retry from client one. Again, this is the same 
behavior that exists today. Aurora does not support multi-operation 
transactions, therefore, scheduler state can change in between a mutable 
operation and follow-up read-only or mutable operations.

I believe we can review and accept this patch.

- Mehrdad

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58768/#review173237
-----------------------------------------------------------

On May 2, 2017, 9:33 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58768/
> -----------------------------------------------------------
> 
> (Updated May 2, 2017, 9:33 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1924
>     https://issues.apache.org/jira/browse/AURORA-1924
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Aurora scheduler rejects a request to create a job if a job with the same key 
> already exists. Aurora client exits with an error once it receives a response 
> with `ResponseCode.INVALID_REQUEST` from scheduler in this case.
> 
> However, an attempt to create a job with the exact same configuration and 
> number of instances is essentially idempotent. Scheduler can detect this 
> situation, ignore it, and signal client to treat operation as successful; 
> client warns user about existing job but does not fail the operation.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 3749531b5412d7ca217736aa85eed8e6606225ad 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  059fbb86a575f5b3d78a63c9a7b5a9eebb6cb3ae 
>   src/main/python/apache/aurora/client/cli/jobs.py 
> b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
>  016859ca3bf83f64d2576b4c7109729770f9e25c 
>   src/test/python/apache/aurora/client/cli/test_create.py 
> 3b09bb25e919bac2795ccd56bd98657b1f98690b 
> 
> 
> Diff: https://reviews.apache.org/r/58768/diff/1/
> 
> 
> Testing
> -------
> 
> - Manually under Vagrant
> - End to end test script
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>

Re: Review Request 58768: AURORA-1924 Aurora client should reconcile idempotent job creations

Reply via email to