> On July 29, 2016, 7:25 p.m., Maxim Khutornenko wrote:
> > Finally had time to go over this review in detail. I am afraid the proposed 
> > approach of rolling back the arbitrary update may not work in general case 
> > and result in unexpected outcome, inconsistent job state and hard to 
> > troubleshoot update failures. Once the global update lock is lifted (e.g. 
> > the update finishes), subsequent updates and/or kill/add instance RPC calls 
> > will drift the job state making the history obsolete. 
> > 
> > For example, consider updates changing instance counts:
> > U1: 5 -> 10
> > U2: 10 -> 20 (or addInstances(0, 10))
> > 
> > If we now attempt to rollback U1 we will end up with a hole [5-9] in the 
> > instance count: 0-4, 10-19 AND likely unexpected extra 10 instances left 
> > (10-19). 
> > 
> > Following the above example we can come up with plenty of other intricate 
> > and hard to reason scenarios. By allowing randomly old update rollback we 
> > are giving our users a powerful tool that is very hard to understand or 
> > ensure correctness. 
> > 
> > The only way this _may_ be improved if we approach rollbackJobUpdate RPC as 
> > the one creating a completely new update request with the old initial state 
> > used as the new desired. This way _all_ current job instances are being 
> > considered, job diff gives correct output and quota checks are properly 
> > done. The latter btw is another problem as rollbackJobUpdate does not call 
> > `validateTaskLimits` to ensure there is enough quota to rollback something 
> > like 10 -> 5. The open problem: in case of multiple initial states to 
> > rollback to, which one to choose as the new desired one? We can only choose 
> > one according to our thrift schema (for a good reason). 
> > 
> > We have thought long and hard about the rollback cases in our internal 
> > deployment orchestration tool and ended up with the approach that operates 
> > at the .aurora config (JobUpdateRequest) level by creating a _new_ update 
> > with the target job config (JobUpdateRequest) of the _old_ one. This way we 
> > can safely rollback to the previously deployed version without violating 
> > job/cluster invariants. 
> > 
> > Given the concerns in this CR, I suggest we move this discussion to dev 
> > list or a design doc.
> > 
> > To seed our further discussion: we can consider an approach where instead 
> > of rolling back a job update itself we rollback _to_ that update's target 
> > state. For example (desired state === desiredState + settings + 
> > instanceCount):
> > U1: desired state A
> > U2: desired state B
> > U3: desired state C
> > 
> > To rollback to state B: U4: use desired state of U2
> > To rollback to state A: U4: use desired state of U1
> > 
> > While not ideal (e.g. there is no way to rollback to state A-1), this would 
> > be safer and easier to reason about. It'd also require a smaller diff as 
> > most of the required code paths already exist.
> 
> Igor Morozov wrote:
>     I think the semantics of rollback is pretty well defined: it is to 
> rollback a job(in whatever state it is right now) to the state defined as 
> initial in the update job. Should not scheduler try to kill extra instances 
> in your example as they violate the job configuration it is rolling back to? 
> In this example that simply means having just 5 instances running after 
> rollback.

> Should not scheduler try to kill extra instances in your example as they 
> violate the job configuration it is rolling back to?

Unfortunately, that's not possible in the current implementation. Once an 
update object is created, the initial and desired states contain only the 
working instance set calculated by the `JobDiff`. Any instances created/killed 
after the update finishes are no longer considered. The working set can be 
further reduced within a given update if `updateOnlyTheseInstances` value is 
specified. 

In other words, the updater goes over a narrowly defined set of instances 
completely ignoring the state of the job after the update has started.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50168/#review144137
-----------------------------------------------------------


On July 28, 2016, 10:54 p.m., Igor Morozov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50168/
> -----------------------------------------------------------
> 
> (Updated July 28, 2016, 10:54 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, Bill Farner, and 
> Zameer Manji.
> 
> 
> Bugs: AURORA-1721
>     https://issues.apache.org/jira/browse/AURORA-1721
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add rollback functionality to the scheduler
> 
> Thic change also includes an addition to storage replicated log. I needed a 
> way to re-insert lock token for an existing job update.
> 
> 
> Diffs
> -----
> 
>   CHANGELOG fc6a46d77ebf889c8f60402c95e8a7472980ba1c 
>   RELEASE-NOTES.md 29d224d5596eb060356f1343cf347db79b1ad687 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 1d66208490aff6ea8af4c737845fa2cf13617529 
>   api/src/main/thrift/org/apache/aurora/gen/storage.thrift 
> 9e4213f13255a182df938bea44ca87fa03a25318 
>   src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java 
> 52c3c6618a3cf1009435ca8a9cece36365913e55 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbJobUpdateStore.java 
> d2673e6b328cb1e249fbe91d18e0d9e935636eaa 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java 
> 39924c62108f6a68f545f90d77ceab4265d41fad 
>   
> src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java 
> d0de063fd78e6c4f62fae4a598d1d22f9775772d 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  b534abf95bab6e1657e3ef993cf34c0d6ec460be 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/aop/AnnotatedAuroraAdmin.java
>  9243c92b11040b68ed6014b3991db69fc08bcddf 
>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
> f8357c46df1b025bf4e38a7ce1cb1c13a50c39f9 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  364c5c753f884a2d89e27802d7bbf3b2b6d3a08e 
>   src/main/python/apache/aurora/client/api/__init__.py 
> 68baf8fdb90cd26100159401c46c9963c24332b3 
>   src/main/python/apache/aurora/client/cli/update.py 
> bb526f7bf94d7bfe02fe2786493c85be1bfeb86f 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> e157c0dfde5efc418448e138aa008ade742fe816 
>   src/test/python/apache/aurora/client/api/test_scheduler_client.py 
> afbd385b7eda64cb1f7d118b695e65e4045eac6c 
> 
> Diff: https://reviews.apache.org/r/50168/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Igor Morozov
> 
>

Reply via email to