> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java
> > Lines 209 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893880#file1893880line209>
> >
> >     I suggest keeping this as a valid transition and allowing the 
> > PartitionManager to deal with it.

See the comment I added to this transition. The TL;DR is it could indefinitely 
block restarts, maintenance and preemption when reschedule=False. I don't think 
this is desirable. Today any such indefinite delays are handled by transient 
task timeouts, but obviously we can't apply those timeouts to PARTITIONED. 

My conclusion after much deliberation (I mention some of this about duplicate 
instances in the design doc) was that it's fine just to move to LOST (which is 
the behavior today). The reasoning I ended up with is that PartitionPolicy is 
intended for keeping tasks *running*, and that partitions encountered when 
tearing tasks down are far less important (or not important at all). 

At the same time, it's a discussion worth fleshing out. The other alternative I 
had considered was to simply ignore PARTITIONED when restarting and wait for 
the transient task timeout to hit. But because we'd only want to do that on 
reschedule=False, this logic would have to live in PartitionManager. But this 
would not solve the problem of actually waiting indefinitely when 
reschedule=False.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java
> > Lines 281 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893880#file1893880line281>
> >
> >     If `FAILED` is in this list, i would expect `FINISHED` as well.

Not sure how I missed that. Thanks!


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/main/python/apache/aurora/config/schema/base.py
> > Lines 157 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893882#file1893882line157>
> >
> >     I would prefer the name `RescheduleImmediately` (or similar) to make 
> > this self-documenting.

See my reply to Jordan. Disable does not mean this, it's an alias for 
reschedule=False. I lifted the name Disabled straight from the comment in the 
proposal doc, but looks like it's added confusion. I'm thinking of just 
reverting that suggestion and going back to explicit PartitionPolicy.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
> > Lines 55 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893887#file1893887line55>
> >
> >     (disclaimer: see my longer comment about an integration test before 
> > trying to satisfy other comments in this file)
> >     
> >     Have a look at `FakeScheduledExecutor` and see if you find it helpful.  
> > The API is a bit weird, but it aims to simplify this type of test where you 
> > need a `Clock` and a `ScheduledExecutor`.  At the very least, it eliminates 
> > the noise of using `Capture`s.
> 
> David McLaughlin wrote:
>     I actually found the capture much easier to read and reason about than 
> the FakeScheduledExecutor, and I thought it led to some pretty clean tests. 
> Curious how others feel.

I had seen the FakeScheduledExecutor and opted not to use it. I actually find 
the capture much easier to read and reason about. Curious how others feel.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
> > Lines 55 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893887#file1893887line55>
> >
> >     (disclaimer: see my longer comment about an integration test before 
> > trying to satisfy other comments in this file)
> >     
> >     Have a look at `FakeScheduledExecutor` and see if you find it helpful.  
> > The API is a bit weird, but it aims to simplify this type of test where you 
> > need a `Clock` and a `ScheduledExecutor`.  At the very least, it eliminates 
> > the noise of using `Capture`s.
> 
> David McLaughlin wrote:
>     I had seen the FakeScheduledExecutor and opted not to use it. I actually 
> find the capture much easier to read and reason about. Curious how others 
> feel.

I actually found the capture much easier to read and reason about than the 
FakeScheduledExecutor, and I thought it led to some pretty clean tests. Curious 
how others feel.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
> > Lines 109-115 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893887#file1893887line109>
> >
> >     Test cases like this lead me to believe that an integration test is 
> > more appropriate.  This test case reads as "nothing happens when a task 
> > with reschedule=false transitions from RUNNING to PARTITIONED".  What we 
> > really mean, though, is that "a task with reschedule=false immediately 
> > moves to LOST upon transition attempt from RUNNING to PARTITIONED".
> >     
> >     With this in mind, i suggest you eliminate this test and move the 
> > coverage into `StateManagerImplTest`.  I think this is rather natural, 
> > since they work in unison to manage state transitions.
> >     
> >     I don't feel strongly about this, since it is not the convention in the 
> > codebase, and i don't know what hurdles you may encounter if you try to 
> > bundle this behavior in `StateManagerImplTest`.

> What we really mean, though, is that "a task with reschedule=false 
> immediately moves to LOST upon transition attempt from RUNNING to 
> PARTITIONED".

reschedule=False means that a transition to LOST is never called, which I think 
this test captures well. 

I thought the fact that I ended up having to test the TaskStateMachine in 
StateManagerImpl was a huge code smell :/


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191229
-----------------------------------------------------------


On Nov. 16, 2017, 10:30 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 16, 2017, 10:30 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill 
> Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is 
> a proposal document to accompany this here: 
> https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, 
> metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 
> 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 
> 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 
> 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   
> src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
>  5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java 
> b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 
> 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java 
> c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java 
> eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py 
> b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 
> 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py 
> bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 
> 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java 
> c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   
> src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java
>  7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java 
> PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 
> 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 
> 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 
> 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 
> 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/7/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With 
> three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED 
> directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a 
> while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED 
> indefinitely)
> 
> I also verified tasks are able to transition to various states (back to 
> RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   
> https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   
> https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   
> https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>

Reply via email to