This is an automatically generated e-mail. To reply, visit:

Finally had time to go over this review in detail. I am afraid the proposed 
approach of rolling back the arbitrary update may not work in general case and 
result in unexpected outcome, inconsistent job state and hard to troubleshoot 
update failures. Once the global update lock is lifted (e.g. the update 
finishes), subsequent updates and/or kill/add instance RPC calls will drift the 
job state making the history obsolete. 

For example, consider updates changing instance counts:
U1: 5 -> 10
U2: 10 -> 20 (or addInstances(0, 10))

If we now attempt to rollback U1 we will end up with a hole [5-9] in the 
instance count: 0-4, 10-19 AND likely unexpected extra 10 instances left 

Following the above example we can come up with plenty of other intricate and 
hard to reason scenarios. By allowing randomly old update rollback we are 
giving our users a powerful tool that is very hard to understand or ensure 

The only way this _may_ be improved if we approach rollbackJobUpdate RPC as the 
one creating a completely new update request with the old initial state used as 
the new desired. This way _all_ current job instances are being considered, job 
diff gives correct output and quota checks are properly done. The latter btw is 
another problem as rollbackJobUpdate does not call `validateTaskLimits` to 
ensure there is enough quota to rollback something like 10 -> 5. The open 
problem: in case of multiple initial states to rollback to, which one to choose 
as the new desired one? We can only choose one according to our thrift schema 
(for a good reason). 

We have thought long and hard about the rollback cases in our internal 
deployment orchestration tool and ended up with the approach that operates at 
the .aurora config (JobUpdateRequest) level by creating a _new_ update with the 
target job config (JobUpdateRequest) of the _old_ one. This way we can safely 
rollback to the previously deployed version without violating job/cluster 

Given the concerns in this CR, I suggest we move this discussion to dev list or 
a design doc.

To seed our further discussion: we can consider an approach where instead of 
rolling back a job update itself we rollback _to_ that update's target state. 
For example (desired state === desiredState + settings + instanceCount):
U1: desired state A
U2: desired state B
U3: desired state C

To rollback to state B: U4: use desired state of U2
To rollback to state A: U4: use desired state of U1

While not ideal (e.g. there is no way to rollback to state A-1), this would be 
safer and easier to reason about. It'd also require a smaller diff as most of 
the required code paths already exist.

- Maxim Khutornenko

On July 28, 2016, 10:54 p.m., Igor Morozov wrote:
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50168/
> -----------------------------------------------------------
> (Updated July 28, 2016, 10:54 p.m.)
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, Bill Farner, and 
> Zameer Manji.
> Bugs: AURORA-1721
>     https://issues.apache.org/jira/browse/AURORA-1721
> Repository: aurora
> Description
> -------
> Add rollback functionality to the scheduler
> Thic change also includes an addition to storage replicated log. I needed a 
> way to re-insert lock token for an existing job update.
> Diffs
> -----
>   CHANGELOG fc6a46d77ebf889c8f60402c95e8a7472980ba1c 
>   RELEASE-NOTES.md 29d224d5596eb060356f1343cf347db79b1ad687 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 1d66208490aff6ea8af4c737845fa2cf13617529 
>   api/src/main/thrift/org/apache/aurora/gen/storage.thrift 
> 9e4213f13255a182df938bea44ca87fa03a25318 
>   src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java 
> 52c3c6618a3cf1009435ca8a9cece36365913e55 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbJobUpdateStore.java 
> d2673e6b328cb1e249fbe91d18e0d9e935636eaa 
>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java 
> 39924c62108f6a68f545f90d77ceab4265d41fad 
> src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java 
> d0de063fd78e6c4f62fae4a598d1d22f9775772d 
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  b534abf95bab6e1657e3ef993cf34c0d6ec460be 
> src/main/java/org/apache/aurora/scheduler/thrift/aop/AnnotatedAuroraAdmin.java
>  9243c92b11040b68ed6014b3991db69fc08bcddf 
>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
> f8357c46df1b025bf4e38a7ce1cb1c13a50c39f9 
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  364c5c753f884a2d89e27802d7bbf3b2b6d3a08e 
>   src/main/python/apache/aurora/client/api/__init__.py 
> 68baf8fdb90cd26100159401c46c9963c24332b3 
>   src/main/python/apache/aurora/client/cli/update.py 
> bb526f7bf94d7bfe02fe2786493c85be1bfeb86f 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> e157c0dfde5efc418448e138aa008ade742fe816 
>   src/test/python/apache/aurora/client/api/test_scheduler_client.py 
> afbd385b7eda64cb1f7d118b695e65e4045eac6c 
> Diff: https://reviews.apache.org/r/50168/diff/
> Testing
> -------
> Thanks,
> Igor Morozov

Reply via email to