> On April 27, 2017, 12:06 p.m., Mehrdad Nurolahzade wrote:
> > This does not work as intended in presence of multiple clients.
> > 
> > Example timeline:
> > 
> > Client 1|Client 2|Request|Scheduler|Response
> > --------|--------|-------|---------|--------
> > Create J|        |OK     |Created  |FAIL      // Delivering response 
> > failed, client 1 will retry after 5 seconds
> >         |Kill J  |OK     |Killed   |OK        // Client 2 successfully 
> > killed J
> > Create J|        |OK     |Created  |OK        // Client 1 will conclude 
> > that it has successfully created J while the global state has been 
> > comprimised.
> 
> Mehrdad Nurolahzade wrote:
>     Reflecting on the behavior change introduced by this patch, I am no 
> longer concerned. Here is the justification.
>     
>     In the multi-client world of Aurora where clients can concurrently access 
> scheduler and submit requests over unreliable communication channels, one of 
> the following four situations can happen when it comes to job creation:
>     
>     1. **One request**: job create request is received, processed, and 
> response is delivered to client. Request is successful if key does not exist, 
> and is failed otherwise (`ResponseCode.INVALID_REQUEST` with no 
> `JobCreateResult`).
>     2. **One request, retried**: job create request/response is not 
> received/delivered, client retries request after 5 seconds. If it was 
> received the first time it is softly rejected this time 
> (`ResponseCode.INVALID_REQUEST` with a `JobCreateResult`). If it was not 
> received the first time, it is processed this time and job is either created 
> or request fails (case 1 above).
>     3. **Two requests, read-only operation in between**: job create 
> request/response is not received/delivered, client one retries request after 
> 5 seconds, scheduler handles a read-only operation from client two associated 
> with the same job in between the two requests from client one. The concern 
> here is that the client two might make a decision based on the state of the 
> job that might no longer be valid after the retry from client one. But, this 
> also happens today. Aurora does not provide atomic CAS operation support, 
> therefore there is no gurantee that scheduler state does not change in 
> between a read and the follow-up mutable operation.
>     4. **Two request, mutable operation in between**: job create 
> request/response is not received/delivered, client one retries request after 
> 5 seconds, scheduler handles a mutable operation from client two associated 
> with the same job in between the two requests from client one (the scenario 
> depicted in my previous comment). The concern here is that client two might 
> make a decision based on the modification it just made to the state of the 
> job that might no longer be valid after the retry from client one. Again, 
> this is the same behavior that exists today. Aurora does not support 
> multi-operation transactions, therefore, scheduler state can change in 
> between a mutable operation and follow-up read-only or mutable operations.
>     
>     I believe we can review and accept this patch.
> 
> David McLaughlin wrote:
>     Now that we shipped the change to not automatically retry job create - is 
> this still necessary?

It provides slightly better user experience, but we can live without it.

**Without this patch**: When a job create fails, it's not automatically 
retried, therefore client has to either query scheduler state to verify state 
or keep resubmitting command until it either succeeds (job is created) or fails 
(job already exists).

**With this patch**: When a job create fails, it is autoamtically retried 
(assuming that we turn on `retry=True` argument on `createJob()` client API), 
and ultimately succeeds whether it creates the job or learns that it has been 
created in one of the previous tries.


- Mehrdad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58768/#review173237
-----------------------------------------------------------


On May 2, 2017, 9:33 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58768/
> -----------------------------------------------------------
> 
> (Updated May 2, 2017, 9:33 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1924
>     https://issues.apache.org/jira/browse/AURORA-1924
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Aurora scheduler rejects a request to create a job if a job with the same key 
> already exists. Aurora client exits with an error once it receives a response 
> with `ResponseCode.INVALID_REQUEST` from scheduler in this case.
> 
> However, an attempt to create a job with the exact same configuration and 
> number of instances is essentially idempotent. Scheduler can detect this 
> situation, ignore it, and signal client to treat operation as successful; 
> client warns user about existing job but does not fail the operation.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 3749531b5412d7ca217736aa85eed8e6606225ad 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  059fbb86a575f5b3d78a63c9a7b5a9eebb6cb3ae 
>   src/main/python/apache/aurora/client/cli/jobs.py 
> b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
>  016859ca3bf83f64d2576b4c7109729770f9e25c 
>   src/test/python/apache/aurora/client/cli/test_create.py 
> 3b09bb25e919bac2795ccd56bd98657b1f98690b 
> 
> 
> Diff: https://reviews.apache.org/r/58768/diff/1/
> 
> 
> Testing
> -------
> 
> - Manually under Vagrant
> - End to end test script
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>

Reply via email to