> On Sept. 22, 2016, 2:25 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 152
> > <https://reviews.apache.org/r/52141/diff/1/?file=1507901#file1507901line152>
> >
> >     Something is wrong here.
> >     
> >     The listener from `GuavaUtils.LifecycleShutdownListener` should have 
> > already been applied to this service in `SchedulerServicesModule`. We used 
> > `addSchedulerActiveServiceBinding` to add the BatchWorker services so I'm 
> > not sure why this helps at all.

Ah, that's right, that should be enough. So, the fix is even simpler, just let 
the error bubble up and the existing shutdown hook will take care of the rest. 
Verified in vagrant that's exactly what happens.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review150000
-----------------------------------------------------------


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
>     https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> ######Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
>     storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
>     for (IScheduledTask scheduledTask : scheduledTasks) {
>       updateTaskAndExternalState(
>           storeProvider.getUnsafeTaskStore(),
>           Tasks.id(scheduledTask),
>           Optional.of(scheduledTask),
>           Optional.of(PENDING),
>           Optional.absent());
>     }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ######Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 2ec3967ddb1d470cf681de062a6400f647978185 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> 7c8047a7235a937c29fe96517242923ff84a080c 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
> d390c07522d22e43d79ce4370985f3643ef021ca 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 8556253fc11f6027316871eb9dc66d8627a77fe6 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> -------
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Reply via email to