> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, > > line 273 > > <https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273> > > > > I think avoid acquisition of a write lock here is a good goal to aim > > for. If, for example, we are already holding the write lock for a large > > snapshot, we could cause it to appear as though we have not received a > > pulse for an extended duration. > > > > Also, given that there is likely to be an automated system on the other > > end, we should be somewhat defensive to a high call frequency. This > > suggests to me that potentially-expensive read operations should be avoided > > if possible. > > > > For these reasons, i think it would be wise to decouple recording of > > pulses from reacting to them. It's also apparent that there is currently > > no asynchronous transition to the 'blocked' states, they're triggered > > either by a tardy pulse or another external action. I think this would be > > surprising behavior, and the scheduler should automatically transition to > > these states without any external input. > > Maxim Khutornenko wrote: > "I think avoid acquisition of a write lock here is a good goal to aim > for. If, for example, we are already holding the write lock for a large > snapshot, we could cause it to appear as though we have not received a pulse > for an extended duration." - makes sense, I can try to split the recording > part from the status update via an async action. > > "Also, given that there is likely to be an automated system on the other > end, we should be somewhat defensive to a high call frequency. This suggests > to me that potentially-expensive read operations should be avoided if > possible." - I don't see how this would be possible without breaking the RPC > contract as I still have to pull JobUpdateDetails in order to respond to the > pulse with proper status message (FINISHED|PAUSED|OK). > > > "It's also apparent that there is currently no asynchronous transition to > the 'blocked' states, they're triggered either by a tardy pulse or another > external action. I think this would be surprising behavior, and the > scheduler should automatically transition to these states without any > external input." - the transition to BLOCKED state happens during the > evaluation loop similar to other internally triggered state transitions > (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states > do not happen instantenously and are triggered by instance activities (i.e. > task state change events). I don't see why BLOCKED state should be any > different. The update is not technically alive until the evaluation event > arrives. Changing this behavior just for heartbeats is not worth the extra > complexity and will be inconsistent with the rest of the feature.
> I don't see how this would be possible without breaking the RPC contract as I > still have to pull JobUpdateDetails in order to respond to the pulse with > proper status message (FINISHED|PAUSED|OK) Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. > the transition to BLOCKED state happens during the evaluation loop similar to > other internally triggered state transitions > (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states > do not happen instantenously and are triggered by instance activities (i.e. > task state change events). I don't see why BLOCKED state should be any > different. The update is not technically alive until the evaluation event > arrives. Changing this behavior just for heartbeats is not worth the extra > complexity and will be inconsistent with the rest of the feature Transitions do indeed happen asynchronously when dealing with timeouts (relevant code is in `InstanceActionHandler`). These fall back to the same loop, but give the updater a time after which the state should be re-evaluated. - Bill ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71063 ----------------------------------------------------------- On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/30225/ > ----------------------------------------------------------- > > (Updated Feb. 4, 2015, 5:24 p.m.) > > > Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. > > > Bugs: AURORA-1010 > https://issues.apache.org/jira/browse/AURORA-1010 > > > Repository: aurora > > > Description > ------- > > Added pulsing support into the JobUpdateController. The qualified coordinated > updates get blocked until a pulse arrives. An update then becomes active and > proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a > terminal state (whichever comes first). > > Not particularly happy with plumbing through OneWayJobUpdater but the > alternative is a state machine change, which is much hairier and will require > more changes in the JobUpdaterController. Going with the minimal diff here. > > > Diffs > ----- > > api/src/main/thrift/org/apache/aurora/gen/api.thrift > 4f603f9e7ed004e53937a45ea8edf7241e15f5cf > src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java > 7f2ec71d744d5fafac84291cc79feba3398d0e1e > src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java > b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 > src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java > d479d20259f284528b2291e2e486b36d8e47fe5e > > src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java > 60f535988a20638fb16515d5027bfa65f1afb73c > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java > d3b30d48b76d8d7c64cda006a34f7ed3296526f2 > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java > ec9d1e07abca1bf30262e1c0f741a34741e96f5d > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java > 76460f95e058181b711fb6869f5a34c1d5bdfe31 > src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java > 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 > > src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml > f9c9ceddc559b43b4a5c45c745d54ff47484edde > > src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java > ca7c0c2675477cc727ca006697665f997972dfde > > src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java > 89765ac3d34a827d3748fb96a275d78e9d1b8b72 > src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java > 41d422939209d0808235128e4242c11e8ef25969 > src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java > 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 > > Diff: https://reviews.apache.org/r/30225/diff/ > > > Testing > ------- > > ./gradlew -Pq build > > > Thanks, > > Maxim Khutornenko > >
