> On Sept. 22, 2016, 2:24 p.m., Joshua Cohen wrote:
> > What's the behavior pre-BatchWorker in this case? Would we fail hard, or is 
> > there something inherent to the batch worker that makes this necessary?
> > 
> > Also, this fixes the error handling that led to corruption in the event of 
> > a failure, but should we also determine the root cause that led to the 
> > failure to begin with (as part of another review no doubt)?

The pre-BatchWorker behavior depended on where/how the error was thrown and 
could have resulted in a hard shutdown or a mysterious silent async failure 
killing one of the executor threads. With the BatchWorker, the scheduler will 
always fail on unhandled exception. This is needed to preserve the transaction 
boundaries and let the error rollback the changes.

> ...but should we also determine the root cause that led to the failure to 
> begin with  

I thought it should have been clear from the RB description but then I realized 
I explained it in Slack and not here. We purposefully induce scheduler 
failovers many times during the e2e run to verify things like job updates still 
work. If the shutdown happens during the transaction, the DB gets closed 
underneath and the subsequent queries fail any way they like. This is exactly 
what happened in this case. A task is saved, driver exits, StateManager 
attempts to save task events but fails due to "DB closed" error. That error is 
caught by the BatchWorker and the native log transaction successfully completes 
before the scheduler fully terminates.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review149999
-----------------------------------------------------------


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
>     https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> ######Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
>     storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
>     for (IScheduledTask scheduledTask : scheduledTasks) {
>       updateTaskAndExternalState(
>           storeProvider.getUnsafeTaskStore(),
>           Tasks.id(scheduledTask),
>           Optional.of(scheduledTask),
>           Optional.of(PENDING),
>           Optional.absent());
>     }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ######Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> -------
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Reply via email to