> On July 29, 2016, 7:25 p.m., Maxim Khutornenko wrote:
> > Finally had time to go over this review in detail. I am afraid the proposed 
> > approach of rolling back the arbitrary update may not work in general case 
> > and result in unexpected outcome, inconsistent job state and hard to 
> > troubleshoot update failures. Once the global update lock is lifted (e.g. 
> > the update finishes), subsequent updates and/or kill/add instance RPC calls 
> > will drift the job state making the history obsolete. 
> > 
> > For example, consider updates changing instance counts:
> > U1: 5 -> 10
> > U2: 10 -> 20 (or addInstances(0, 10))
> > 
> > If we now attempt to rollback U1 we will end up with a hole [5-9] in the 
> > instance count: 0-4, 10-19 AND likely unexpected extra 10 instances left 
> > (10-19). 
> > 
> > Following the above example we can come up with plenty of other intricate 
> > and hard to reason scenarios. By allowing randomly old update rollback we 
> > are giving our users a powerful tool that is very hard to understand or 
> > ensure correctness. 
> > 
> > The only way this _may_ be improved if we approach rollbackJobUpdate RPC as 
> > the one creating a completely new update request with the old initial state 
> > used as the new desired. This way _all_ current job instances are being 
> > considered, job diff gives correct output and quota checks are properly 
> > done. The latter btw is another problem as rollbackJobUpdate does not call 
> > `validateTaskLimits` to ensure there is enough quota to rollback something 
> > like 10 -> 5. The open problem: in case of multiple initial states to 
> > rollback to, which one to choose as the new desired one? We can only choose 
> > one according to our thrift schema (for a good reason). 
> > 
> > We have thought long and hard about the rollback cases in our internal 
> > deployment orchestration tool and ended up with the approach that operates 
> > at the .aurora config (JobUpdateRequest) level by creating a _new_ update 
> > with the target job config (JobUpdateRequest) of the _old_ one. This way we 
> > can safely rollback to the previously deployed version without violating 
> > job/cluster invariants. 
> > 
> > Given the concerns in this CR, I suggest we move this discussion to dev 
> > list or a design doc.
> > 
> > To seed our further discussion: we can consider an approach where instead 
> > of rolling back a job update itself we rollback _to_ that update's target 
> > state. For example (desired state === desiredState + settings + 
> > instanceCount):
> > U1: desired state A
> > U2: desired state B
> > U3: desired state C
> > 
> > To rollback to state B: U4: use desired state of U2
> > To rollback to state A: U4: use desired state of U1
> > 
> > While not ideal (e.g. there is no way to rollback to state A-1), this would 
> > be safer and easier to reason about. It'd also require a smaller diff as 
> > most of the required code paths already exist.
> 
> Igor Morozov wrote:
>     I think the semantics of rollback is pretty well defined: it is to 
> rollback a job(in whatever state it is right now) to the state defined as 
> initial in the update job. Should not scheduler try to kill extra instances 
> in your example as they violate the job configuration it is rolling back to? 
> In this example that simply means having just 5 instances running after 
> rollback.
> 
> Maxim Khutornenko wrote:
>     > Should not scheduler try to kill extra instances in your example as 
> they violate the job configuration it is rolling back to?
>     
>     Unfortunately, that's not possible in the current implementation. Once an 
> update object is created, the initial and desired states contain only the 
> working instance set calculated by the `JobDiff`. Any instances 
> created/killed after the update finishes are no longer considered. The 
> working set can be further reduced within a given update if 
> `updateOnlyTheseInstances` value is specified. 
>     
>     In other words, the updater goes over a narrowly defined set of instances 
> completely ignoring the state of the job after the update has started.

gotcha. Yes, then it is going to be a problem. Let's discuss it in a ticket. I 
have another suggestion.


- Igor


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50168/#review144137
-----------------------------------------------------------


On July 28, 2016, 10:54 p.m., Igor Morozov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50168/
> -----------------------------------------------------------
> 
> (Updated July 28, 2016, 10:54 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, Bill Farner, and 
> Zameer Manji.
> 
> 
> Bugs: AURORA-1721
>     https://issues.apache.org/jira/browse/AURORA-1721
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add rollback functionality to the scheduler
> 
> Thic change also includes an addition to storage replicated log. I needed a 
> way to re-insert lock token for an existing job update.
> 
> 
> Diffs
> -----
> 
>   CHANGELOG fc6a46d77ebf889c8f60402c95e8a7472980ba1c 
>   RELEASE-NOTES.md 29d224d5596eb060356f1343cf347db79b1ad687 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 1d66208490aff6ea8af4c737845fa2cf13617529 
>   api/src/main/thrift/org/apache/aurora/gen/storage.thrift 
> 9e4213f13255a182df938bea44ca87fa03a25318 
>   src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java 
> 52c3c6618a3cf1009435ca8a9cece36365913e55 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbJobUpdateStore.java 
> d2673e6b328cb1e249fbe91d18e0d9e935636eaa 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java 
> 39924c62108f6a68f545f90d77ceab4265d41fad 
>   
> src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java 
> d0de063fd78e6c4f62fae4a598d1d22f9775772d 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  b534abf95bab6e1657e3ef993cf34c0d6ec460be 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/aop/AnnotatedAuroraAdmin.java
>  9243c92b11040b68ed6014b3991db69fc08bcddf 
>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
> f8357c46df1b025bf4e38a7ce1cb1c13a50c39f9 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  364c5c753f884a2d89e27802d7bbf3b2b6d3a08e 
>   src/main/python/apache/aurora/client/api/__init__.py 
> 68baf8fdb90cd26100159401c46c9963c24332b3 
>   src/main/python/apache/aurora/client/cli/update.py 
> bb526f7bf94d7bfe02fe2786493c85be1bfeb86f 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> e157c0dfde5efc418448e138aa008ade742fe816 
>   src/test/python/apache/aurora/client/api/test_scheduler_client.py 
> afbd385b7eda64cb1f7d118b695e65e4045eac6c 
> 
> Diff: https://reviews.apache.org/r/50168/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Igor Morozov
> 
>

Reply via email to