-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review149963
-----------------------------------------------------------


Ship it!




Ship It!

- Stephan Erb


On Sept. 22, 2016, 7:38 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2016, 7:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
>     https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> ######Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
>     storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
>     for (IScheduledTask scheduledTask : scheduledTasks) {
>       updateTaskAndExternalState(
>           storeProvider.getUnsafeTaskStore(),
>           Tasks.id(scheduledTask),
>           Optional.of(scheduledTask),
>           Optional.of(PENDING),
>           Optional.absent());
>     }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ######Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> -------
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Reply via email to