> On Dec. 11, 2015, 7:07 p.m., Bill Farner wrote:
> > Can you take a stab at illustrating in the description how this problem 
> > manifests?  The current shape of the patch feels like it papers over a 
> > problem that we might be left to solve again, so thinking about the problem 
> > at a high level might help us think about a broader fix.

I agree it may feel somewhat hacky but it's the easiest way to handle 
updater/event race without adding more moving parts or excessive locking.

The core of the problem is (as mentioned in the description) in the way we 
process task events. After moving into async processing, events can be (and 
are) delayed. It's a hard problem to follow but I'll try to explain with an 
example of AURORA-1506:
- instance 1 fails while been watched by the updater, event F is dispatched 
(e.g.: RUNNING->FAILED)
- instance 1 is rescheduled and event R is dispatched (e.g.: INIT->THROTTLED)
- event F reached updater and updater decides to rollback, issues a kill for 
the instance intending to return to previous config
- THROTTLED task is deleted from the store without firing a task change event 
(but firing a task deleted event)
- event R is finally processed by async bus and updater is notified. Updater 
follows the evaluate path and "thinks" a KILL is still required, which is legit 
but does not make sense as the task is no longer in the store.

So, the crux of the problem is entering updater evaluator loop with a stale 
task info. Figuring out if the task info is stale at the subscriber point is 
impossible as it requires updater inner state. The best, as it seems to me, let 
the updater figure out what to do but prevent the state changing action if it 
conflicts with the store. Sort of optimistic evaluation run.

Now to the question: The only downside of this race is a stack trace in the 
log. The actual errors are handled by the `JobUpdateEventSubscriber` and do not 
fail the update. These errors are polluting our logs and monitoring stats.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41226/#review109984
-----------------------------------------------------------


On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41226/
> -----------------------------------------------------------
> 
> (Updated Dec. 11, 2015, 1:33 a.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-1506 and AURORA-1507
>     https://issues.apache.org/jira/browse/AURORA-1506
>     https://issues.apache.org/jira/browse/AURORA-1507
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task event processing delays may generate duplicate evaluation results in the 
> `InstanceActionHandler`. This leads to occasional stack traces in the log but 
> is generally benign as `JobUpdateEventSubscriber` catches these errors. 
> 
> The easiest fix is to check for the task existence before attempting to 
> kill/add instance.
> 
> 
> Diffs
> -----
> 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> d8686f1f1b53e5ff2791663489e24c342503831e 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> 0583a63a175880355a7296ebd1c6e6fb5dc99f38 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41226/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Reply via email to