----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/52141/#review150035 -----------------------------------------------------------
Ship it! src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 195) <https://reviews.apache.org/r/52141/#comment217850> For posterity, This is important because the Guava docs says that `run()` should respond to shutdown requests by checking `isRunning()`. If we block indefinately we technically don't respond to shutdown requests. - Zameer Manji On Sept. 22, 2016, 8:20 a.m., Maxim Khutornenko wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/52141/ > ----------------------------------------------------------- > > (Updated Sept. 22, 2016, 8:20 a.m.) > > > Review request for Aurora, Stephan Erb and Zameer Manji. > > > Bugs: AURORA-1779 > https://issues.apache.org/jira/browse/AURORA-1779 > > > Repository: aurora > > > Description > ------- > > ######Problem > > The current `BatchWorker` error handling assumes graceful recovery if any of > the individual batch items fail. This may not hold true if the storage is > modified _before_ the error is thrown and the `LogStorage` transaction is not > rolled back. Consider the following fragment from the `StateManagerImpl`: > ``` > storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks); > > for (IScheduledTask scheduledTask : scheduledTasks) { > updateTaskAndExternalState( > storeProvider.getUnsafeTaskStore(), > Tasks.id(scheduledTask), > Optional.of(scheduledTask), > Optional.of(PENDING), > Optional.absent()); > } > ``` > If a task is saved but the subsequent `updateTaskAndExternalState()` throws > AND `BatchWorker` catches and logs that exception, the native store > transaction will be committed and as a result the storage will be left in a > logically inconsistent state. > > ######Solution > > I can see at least 3 ways to solve this problem: > 1. Fail hard and shutdown scheduler when any of the batch items throw; > 2. Fail only the last batch (drop all its items) and let storage transaction > roll back; > 3. Implement `BatchWorker` transaction where items _before_ and _after_ the > failed item would be retried when the storage transaction rolls back; > > The #3 is the most involved and would be very hard to get right (assuming > batch items are idempotent, which may not be the case). The #2 may result in > a very hard to troubleshoot scenario where _some_ items would be dropped on > the floor and never completed. > > I suggest we go with #1 as the most straightforward and transparent approach. > It's also the only one that ensures storage state consistency, which is the > most vital invariant to preserve (as AURORA-1779 proves). The only downside > of this approach is that scheduler will go down hard any time an unhandled > error is thrown but arguably this is the easiest way to improve our codebase > and certainly better than leaving the scheduler in a crippled state. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/BatchWorker.java > e05d4b48749b0691902a505bea1b4490fdd08f29 > src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java > a86dc82cafa7f5436d2b8d00c6db575ff8311eea > > Diff: https://reviews.apache.org/r/52141/diff/ > > > Testing > ------- > > unit and e2e tests > > > Thanks, > > Maxim Khutornenko > >