----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63536/#review190607 -----------------------------------------------------------
Overall approach appears sound. LGTM to proceed to tests. api/src/main/thrift/org/apache/aurora/gen/api.thrift Lines 526 (patched) <https://reviews.apache.org/r/63536/#comment268091> It seems logical that this accounting would live alongside `ScheduledTask.failureCount`. How would you feel about naming this `timesPartitioned` instead? I feel that more clearly disambiguates from overloaded meanings of `partition`, e.g. https://en.wikipedia.org/wiki/Partition_(database) src/main/java/org/apache/aurora/scheduler/base/Conversions.java Lines 71-72 (original), 71-72 (patched) <https://reviews.apache.org/r/63536/#comment268092> Comment is now stale. src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java Lines 100 (patched) <https://reviews.apache.org/r/63536/#comment268093> How about `-partition_aware` instead? I find the `enable` qualifier unnecessary. src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java Lines 31 (patched) <https://reviews.apache.org/r/63536/#comment268094> How about s/futureMap/delayedTransitions/ src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java Lines 40 (patched) <https://reviews.apache.org/r/63536/#comment268096> Please use `AsyncUtil.singleThreadLoggingScheduledExecutor()` instead. In addition to automatically logging and tracking exceptions, it will create a daemon thread. src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java Lines 48 (patched) <https://reviews.apache.org/r/63536/#comment268097> Consider using `Tasks.getLatestEvent(task).getTimestamp()` to hide the need for `Iterables.getLast()` src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java Lines 53-54 (patched) <https://reviews.apache.org/r/63536/#comment268100> `String taskId = Tasks.id(stateChange.getTask());` src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java Lines 69 (patched) <https://reviews.apache.org/r/63536/#comment268103> Readability nit: ```java if (!stateChange.isTransition()) { // This is a storage recovery event, so the calculated time // may be in the past. delay = Math.max(...); } ``` src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java Lines 76 (patched) <https://reviews.apache.org/r/63536/#comment268109> Thinking out loud - the `Optional.of(PARTITIONED)` parameter is already helping you perform a CAS. If we had a _true_ CAS operation, you could very safely avoid the need to cancel futures. You can't do that with the status only since the task can now cycle with `PARTITIONED`. What do you think about using the last transition timestamp to do this? ```java long casLastTransitionMs = Tasks.getLatestEvent(task).getTimestamp(); executor.schedule(() -> { storage.write(... { if (casLastTransition == Tasks.getLatestEvent(task).getTimestamp()) { // proceed } else { // cas failed, the task has since transitioned } } }, delay, TimeUnit.SECONDS); ``` src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java Lines 294 (patched) <https://reviews.apache.org/r/63536/#comment268110> `==` instead of `.equals()` src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java Lines 297 (patched) <https://reviews.apache.org/r/63536/#comment268111> This is unnecessary. src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java Lines 390 (patched) <https://reviews.apache.org/r/63536/#comment268113> s/The goal here is to c/C/ src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java Lines 400 (patched) <https://reviews.apache.org/r/63536/#comment268114> It's worth mentioning that this is done to bound the number of events stored for a task. src/main/java/org/apache/aurora/scheduler/state/StateModule.java Lines 70 (patched) <https://reviews.apache.org/r/63536/#comment268115> PubsubEventModule.bindSubscriber(binder, PartitionManager.class); - Bill Farner On Nov. 8, 2017, 4:48 p.m., David McLaughlin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63536/ > ----------------------------------------------------------- > > (Updated Nov. 8, 2017, 4:48 p.m.) > > > Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill > Farner. > > > Repository: aurora > > > Description > ------- > > This is my prototype code for adding partition-awareness to Aurora. There is > a proposal document to accompany this here: > https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit# > > I'd like feedback on the high-level approach before adding unit tests, > metrics, logging, etc. > > > Diffs > ----- > > api/src/main/thrift/org/apache/aurora/gen/api.thrift > c869493c06499340d73e1b219e17a0d7d8b5ead9 > examples/vagrant/upstart/aurora-scheduler.conf > 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 > src/main/java/org/apache/aurora/scheduler/base/Conversions.java > 33cc012a2cad929b0dd1ce236597b870cfc5aba0 > > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java > 5e83b2acdde7198d16427d4031e9772f78612554 > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java > PRE-CREATION > src/main/java/org/apache/aurora/scheduler/state/SideEffect.java > b91a0852e968b4aa9d74801601671cb61af3648a > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java > 9989ed441cd9bc442e6472768880ce7924c3bdd9 > src/main/java/org/apache/aurora/scheduler/state/StateModule.java > d72f055749801ee9d6f31f60857cc795d0ed7ab1 > src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java > eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 > src/main/python/apache/aurora/client/cli/jobs.py > b79ae56bee0e5692cacf1e66f4a4126b06aaffdc > src/main/python/apache/aurora/config/schema/base.py > 18ce826363009e1cc0beac5cce4edf42610d487e > src/main/python/apache/aurora/config/thrift.py > bedf8cd6641e1b1a930602791b758d584af4891c > > > Diff: https://reviews.apache.org/r/63536/diff/2/ > > > Testing > ------- > > Manual testing in Vagrant by stopping and starting the Mesos agent. With > three jobs: > > 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED > directly to LOST) > 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a > while before moving to LOST) > 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED > indefinitely) > > I also verified tasks are able to transition to various states (back to > RUNNING or moving to FAILED, etc.) when you turn the agent back on. > > > File Attachments > ---------------- > > Task in PARTITIONED state > > https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png > Task back into running when partition resolved > > https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png > Compaction of PARTITIONED cycles (note timestamps) > > https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png > > > Thanks, > > David McLaughlin > >