> On July 17, 2018, 8:52 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
> > Lines 413-417 (patched)
> > <https://reviews.apache.org/r/67696/diff/9/?file=2059744#file2059744line414>
> >
> > I am a bit confused about this. Can you please elaborate a bit more
> > when/how you observed this problem?
> >
> > I am mostly concerned about race conditions here.
> >
> > For example, say I trigger a job update from config A to B. Shortly
> > before I update an instance the instance fails or transitions to LOST.
> > Concurrently we might end up hopping into the `instanceChanged` method
> > here, but bail due to the new guard without ever handling the switch from
> > config A to B.
> >
> > Could be that I am missing something entirely here. Would be great if
> > you could elaborate.
Good question.
When we determine what action we should perform on an instance, we use the
config/state we get from these task events. If we are using stale information,
we may be producing inaccurate actions to perform. This is a rare scenario but
it relies on tons of async events happening at the same time so there is a
large delay in event processing. The scenario that I encountered that lead me
to introduce this guard was the following:
1. Task "ABC-old-1" is being scheduled and goes from ASSIGNED -> RUNNING
(RUNNING is now on queue to be evaluated as an update)
2. A different previous task finishes updating so instance 1 is added to the
working group to be updated. The initial evaluation causes "ABC-old-1" to be
moved to KILLING.
3. "ABC-old-1" is KILLED and "ABC-new-1" is created and running.
4. RUNNING from the async queue is finally evaluated so we see that this
instance number needs to be updated (since it is "ABC-old-1") so we kill the
task at 1 (which is now "ABC-new-1").
5. Before this patch, this would mean the task needs to be killed and added
again. Now, since there is an async aspect to killing, this could mean an
instance # is permanently killed post-update.
I am ensuring this condition is met within the evaluation of state changes.
From `StateEvaluator`:
```
* It is the responsibility of the caller to ensure that the {@code
actualState} is the latest
* value. Note: the caller should avoid calling this when a terminal task is
moving to another
* terminal state. It should also suppress deletion events for tasks that
have been replaced by
* an active task.
```
In the scenario you provide, the condition will work as intended and the update
should still move to completion successfully. A write lock governs all actions
dealing with updates so we do not have race conditions.
1. Instance A needs to be updated.
2. Instance A fails before being evaluated for an update.
3. Since A failed, it will be rescheduled into A'.
- If A is evaluated for the update before being rescheduled (e.g. we evaluate
the state change into PENDING or FAILED before rescheduling), then we kill that
task and update the configuration
- If A is rescheduled before being evaluated for the update, then we will
kill A' and update the configuration
Lots of moving parts for updates so hopefully what I said makes sense :/
- Jordan
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67696/#review206149
-----------------------------------------------------------
On July 17, 2018, 6:21 p.m., Jordan Ly wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67696/
> -----------------------------------------------------------
>
> (Updated July 17, 2018, 6:21 p.m.)
>
>
> Review request for Aurora, David McLaughlin, Daniel Knightly, Renan DelValle,
> Santhosh Kumar Shanmugham, and Stephan Erb.
>
>
> Repository: aurora
>
>
> Description
> -------
>
> This patch enables SLA-aware updates.
>
> Following https://reviews.apache.org/r/66716/, tasks may now specify custom
> SLA policies that will be respected by the scheduler during maintenance. This
> patch integrates into the same system to allow users to specify if they want
> their updates to also respect SLA. Please see
> https://docs.google.com/document/d/1lCoDyoX26qrGrptrgO7vJHqYR_L2CBRGFIywsAd8uQo/edit?usp=sharing
> for a more detailed description.
>
> This patch adds two optional Thrift fields, `slaAware` to `JobUpdateSettings`
> and `message` to `JobInstanceUpdateEvent`. These should be forward and
> backwards compatible.
>
>
> Diffs
> -----
>
> RELEASE-NOTES.md edc081f502370190597ad028f3275cdfd572f5ca
> api/src/main/thrift/org/apache/aurora/gen/api.thrift
> 7265b11103aa12743c42355163ae64e98e965d7f
> docs/features/job-updates.md b52eb35a1de9da40f8a00ef0b905df30069029d3
> docs/reference/configuration.md acab4c58d9ab3c04d156fed3636e77aed6d1faf4
> docs/reference/scheduler-configuration.md
> 805e516689be019101f7c220c89fd9c391bb93b3
> src/main/java/org/apache/aurora/scheduler/base/Tasks.java
> 2e13aacf576e648d9fffe989e4fc05c8954e72d8
> src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java
> 9aa51c3637df74cca088bd65c5539e1ebb8e5f0d
>
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
> 6a28bc274acdd6d3ac239166771ef2d45648d60f
>
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java
> 9fa68b2dd55b4e4f5436356c1b94af1393967679
> src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java
> a002d955c3bc7b7c39da5e130e8c10c536bdcebd
>
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
> ec577cccb86914ebd679ca235103f79dd7e7b79d
> src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java
> f2d33fb9ab6bd2c3ff199ab03dc75b1d6d618f3a
> src/main/java/org/apache/aurora/scheduler/updater/SlaKillController.java
> PRE-CREATION
> src/main/java/org/apache/aurora/scheduler/updater/UpdateFactory.java
> 3992aa77fc305adc390a4aaeb1d3939d6241ddbd
> src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java
> 74ee1745e1fd7c308e2bdfa46aeb18a7ecfe14d2
> src/main/java/org/apache/aurora/scheduler/updater/Updates.java
> f949fd54f524780672167e12fcadf268da08e679
> src/main/python/apache/aurora/client/api/updater_util.py
> 4e3986220aaa4c9b138394b962120b176185af12
> src/main/python/apache/aurora/config/schema/base.py
> 7baded79acdf863670afc183d740dcad602490c2
> src/test/java/org/apache/aurora/scheduler/base/TaskTestUtil.java
> abee0951ca998894b29ee32c5362ef30da6421c7
> src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java
> dcf58896f1e866c0369ba1b78060236e98d9d46b
>
> src/test/java/org/apache/aurora/scheduler/storage/AbstractJobUpdateStoreTest.java
> 3a06a451da4ef3acccb33b5495b9fae141557148
>
> src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java
> 0aea369d8a8f75291de9691b6d61f3d48895507c
>
> src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
> aa1cb2b287642e87d787e160e04a17ad0e4690d9
> src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java
> 43f857d893a54e19e71b36f2f06fef3a3ef6e874
> src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java
> 5667a1b59681a6de87149d7161d760bff5da3818
> src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java
> 2c27ec7136ff22a3570a4ec278c73f7ee310f628
>
> src/test/java/org/apache/aurora/scheduler/updater/SlaKillControllerTest.java
> PRE-CREATION
> src/test/python/apache/aurora/client/cli/test_inspect.py
> 2baba2aa55865ec298a4c9e5af3952b56cb9a910
>
> src/test/resources/org/apache/aurora/scheduler/storage/durability/goldens/current/saveJobInstanceUpdateEvent
> 48902d39b9d2cbeae1a52180669aba8349e4dd65
>
> src/test/resources/org/apache/aurora/scheduler/storage/durability/goldens/current/saveJobUpdate
> 08dfa5b9a67989083a0d405ce8100698a4d096ae
> ui/src/main/js/components/UpdateInstanceEvents.js
> 8351f2c4c256e27625d94b70842be0e91065a551
> ui/src/main/js/components/UpdateSettings.js
> d756f5916fd8c39dcbb5578ee5eda198f807f458
>
>
> Diff: https://reviews.apache.org/r/67696/diff/10/
>
>
> Testing
> -------
>
> Added unit tests, `./gradlew test`.
>
> Tested at scale with over 10,000 SLA-aware instance update events occuring
> concurrently. Scheduler stability did not seem to be affected.
>
>
> Thanks,
>
> Jordan Ly
>
>