> On April 27, 2017, 7:06 p.m., Mehrdad Nurolahzade wrote: > > This does not work as intended in presence of multiple clients. > > > > Example timeline: > > > > Client 1|Client 2|Request|Scheduler|Response > > --------|--------|-------|---------|-------- > > Create J| |OK |Created |FAIL // Delivering response > > failed, client 1 will retry after 5 seconds > > |Kill J |OK |Killed |OK // Client 2 successfully > > killed J > > Create J| |OK |Created |OK // Client 1 will conclude > > that it has successfully created J while the global state has been > > comprimised. > > Mehrdad Nurolahzade wrote: > Reflecting on the behavior change introduced by this patch, I am no > longer concerned. Here is the justification. > > In the multi-client world of Aurora where clients can concurrently access > scheduler and submit requests over unreliable communication channels, one of > the following four situations can happen when it comes to job creation: > > 1. **One request**: job create request is received, processed, and > response is delivered to client. Request is successful if key does not exist, > and is failed otherwise (`ResponseCode.INVALID_REQUEST` with no > `JobCreateResult`). > 2. **One request, retried**: job create request/response is not > received/delivered, client retries request after 5 seconds. If it was > received the first time it is softly rejected this time > (`ResponseCode.INVALID_REQUEST` with a `JobCreateResult`). If it was not > received the first time, it is processed this time and job is either created > or request fails (case 1 above). > 3. **Two requests, read-only operation in between**: job create > request/response is not received/delivered, client one retries request after > 5 seconds, scheduler handles a read-only operation from client two associated > with the same job in between the two requests from client one. The concern > here is that the client two might make a decision based on the state of the > job that might no longer be valid after the retry from client one. But, this > also happens today. Aurora does not provide atomic CAS operation support, > therefore there is no gurantee that scheduler state does not change in > between a read and the follow-up mutable operation. > 4. **Two request, mutable operation in between**: job create > request/response is not received/delivered, client one retries request after > 5 seconds, scheduler handles a mutable operation from client two associated > with the same job in between the two requests from client one (the scenario > depicted in my previous comment). The concern here is that client two might > make a decision based on the modification it just made to the state of the > job that might no longer be valid after the retry from client one. Again, > this is the same behavior that exists today. Aurora does not support > multi-operation transactions, therefore, scheduler state can change in > between a mutable operation and follow-up read-only or mutable operations. > > I believe we can review and accept this patch.
Now that we shipped the change to not automatically retry job create - is this still necessary? - David ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58768/#review173237 ----------------------------------------------------------- On May 3, 2017, 4:33 a.m., Mehrdad Nurolahzade wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58768/ > ----------------------------------------------------------- > > (Updated May 3, 2017, 4:33 a.m.) > > > Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and > Stephan Erb. > > > Bugs: AURORA-1924 > https://issues.apache.org/jira/browse/AURORA-1924 > > > Repository: aurora > > > Description > ------- > > Aurora scheduler rejects a request to create a job if a job with the same key > already exists. Aurora client exits with an error once it receives a response > with `ResponseCode.INVALID_REQUEST` from scheduler in this case. > > However, an attempt to create a job with the exact same configuration and > number of instances is essentially idempotent. Scheduler can detect this > situation, ignore it, and signal client to treat operation as successful; > client warns user about existing job but does not fail the operation. > > > Diffs > ----- > > api/src/main/thrift/org/apache/aurora/gen/api.thrift > 3749531b5412d7ca217736aa85eed8e6606225ad > > src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java > 059fbb86a575f5b3d78a63c9a7b5a9eebb6cb3ae > src/main/python/apache/aurora/client/cli/jobs.py > b79ae56bee0e5692cacf1e66f4a4126b06aaffdc > > src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java > 016859ca3bf83f64d2576b4c7109729770f9e25c > src/test/python/apache/aurora/client/cli/test_create.py > 3b09bb25e919bac2795ccd56bd98657b1f98690b > > > Diff: https://reviews.apache.org/r/58768/diff/1/ > > > Testing > ------- > > - Manually under Vagrant > - End to end test script > > > Thanks, > > Mehrdad Nurolahzade > >
