> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > api/src/main/thrift/org/apache/aurora/gen/api.thrift, line 565 > > <https://reviews.apache.org/r/30225/diff/4/?file=848240#file848240line565> > > > > Maybe s/BLOCKED/AWAITING_PULSE/? > > > > That would at least self-document and avoid additional terminology.
Renamed. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java, line > > 49 > > <https://reviews.apache.org/r/30225/diff/4/?file=848242#file848242line49> > > > > s/query/fetch/ to be consistent with the method above. I started with "fetch" but that resulted in an overloaded method that I did not quite liked. Changed it back. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, > > line 106 > > <https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line106> > > > > Remove Optional here, do Optional.of() at the call site. Removed. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, > > line 122 > > <https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line122> > > > > TODO + ticket, this map needs to be exposed via a debug endpoint Done. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, > > line 273 > > <https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273> > > > > I think avoid acquisition of a write lock here is a good goal to aim > > for. If, for example, we are already holding the write lock for a large > > snapshot, we could cause it to appear as though we have not received a > > pulse for an extended duration. > > > > Also, given that there is likely to be an automated system on the other > > end, we should be somewhat defensive to a high call frequency. This > > suggests to me that potentially-expensive read operations should be avoided > > if possible. > > > > For these reasons, i think it would be wise to decouple recording of > > pulses from reacting to them. It's also apparent that there is currently > > no asynchronous transition to the 'blocked' states, they're triggered > > either by a tardy pulse or another external action. I think this would be > > surprising behavior, and the scheduler should automatically transition to > > these states without any external input. > > Maxim Khutornenko wrote: > "I think avoid acquisition of a write lock here is a good goal to aim > for. If, for example, we are already holding the write lock for a large > snapshot, we could cause it to appear as though we have not received a pulse > for an extended duration." - makes sense, I can try to split the recording > part from the status update via an async action. > > "Also, given that there is likely to be an automated system on the other > end, we should be somewhat defensive to a high call frequency. This suggests > to me that potentially-expensive read operations should be avoided if > possible." - I don't see how this would be possible without breaking the RPC > contract as I still have to pull JobUpdateDetails in order to respond to the > pulse with proper status message (FINISHED|PAUSED|OK). > > > "It's also apparent that there is currently no asynchronous transition to > the 'blocked' states, they're triggered either by a tardy pulse or another > external action. I think this would be surprising behavior, and the > scheduler should automatically transition to these states without any > external input." - the transition to BLOCKED state happens during the > evaluation loop similar to other internally triggered state transitions > (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states > do not happen instantenously and are triggered by instance activities (i.e. > task state change events). I don't see why BLOCKED state should be any > different. The update is not technically alive until the evaluation event > arrives. Changing this behavior just for heartbeats is not worth the extra > complexity and will be inconsistent with the rest of the feature. > > Bill Farner wrote: > > I don't see how this would be possible without breaking the RPC > contract as I still have to pull JobUpdateDetails in order to respond to the > pulse with proper status message (FINISHED|PAUSED|OK) > > Yeah, this crossed my mind. Perhaps we should poke at the API to see if > we can shake those requirements out? For example, we could just return > OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to > differentiate between a paused and finished update? > > I don't feel really strongly about this, but i think we should feel free > to question the requirements at this phase. > > > the transition to BLOCKED state happens during the evaluation loop > similar to other internally triggered state transitions > (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states > do not happen instantenously and are triggered by instance activities (i.e. > task state change events). I don't see why BLOCKED state should be any > different. The update is not technically alive until the evaluation event > arrives. Changing this behavior just for heartbeats is not worth the extra > complexity and will be inconsistent with the rest of the feature > > Transitions do indeed happen asynchronously when dealing with timeouts > (relevant code is in `InstanceActionHandler`). These fall back to the same > loop, but give the updater a time after which the state should be > re-evaluated. > > Maxim Khutornenko wrote: > >Yeah, this crossed my mind. Perhaps we should poke at the API to see if > we can shake those requirements out? For example, we could just return > OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to > differentiate between a paused and finished update? I don't feel really > strongly about this, but i think we should feel free to question the > requirements at this phase. > > I don't think this would buy us much if external service starts following > up each pulse that returns UPDATE_NOT_ACTIVE. It would just makes the > contract harder to consume. If we are concerned about obuse I think having a > rate limiter would make more sense here. Also, I still need to pull > instructions to determine if the pulse is legit (i.e. the update is actually > a coordinated update). Otherwise, pulse goes nowhere and external service > continues to pulse the wrong update. > > Bill Farner wrote: > > Also, I still need to pull instructions to determine if the pulse is > legit (i.e. the update is actually a coordinated update). Otherwise, pulse > goes nowhere and external service continues to pulse the wrong update. > > That's not _necessarily_ required, you could contain it in a `Runnable` > implementation that is responsible for the async transition. > > Maxim Khutornenko wrote: > Not sure I get it. If it's in async part that means the pulse response is > already gone and the external service is oblivious what it actually pulsed. > > David McLaughlin wrote: > For me the logic for pulse is fine, it's just the acquisition of the > write lock that is a little off. It should be this: > > pulse(JobKey key, String updateId) { > // fetch the update status > // if the update is terminal > // return GO_AWAY > // if the update is non-terminal > // update the pulse > > // if the update is currently blocked > // acquire write lock > // change status > } > > > Give or take some operations. You could remove the fetching of the update > status, but then you're just asking the client to start polling the scheduler > every pulse instead to make the same decision. Moved to async action to avoid delaying pulse response. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml, > > line 301 > > <https://reviews.apache.org/r/30225/diff/4/?file=848249#file848249line301> > > > > is it possible to reduce the diff a bit by moving this block to its > > former location, or is it needed This is diff playing tricks. I moved the filter code out to reuse it with details select. - Maxim ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71063 ----------------------------------------------------------- On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/30225/ > ----------------------------------------------------------- > > (Updated Feb. 4, 2015, 5:24 p.m.) > > > Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. > > > Bugs: AURORA-1010 > https://issues.apache.org/jira/browse/AURORA-1010 > > > Repository: aurora > > > Description > ------- > > Added pulsing support into the JobUpdateController. The qualified coordinated > updates get blocked until a pulse arrives. An update then becomes active and > proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a > terminal state (whichever comes first). > > Not particularly happy with plumbing through OneWayJobUpdater but the > alternative is a state machine change, which is much hairier and will require > more changes in the JobUpdaterController. Going with the minimal diff here. > > > Diffs > ----- > > api/src/main/thrift/org/apache/aurora/gen/api.thrift > 4f603f9e7ed004e53937a45ea8edf7241e15f5cf > src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java > 7f2ec71d744d5fafac84291cc79feba3398d0e1e > src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java > b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 > src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java > d479d20259f284528b2291e2e486b36d8e47fe5e > > src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java > 60f535988a20638fb16515d5027bfa65f1afb73c > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java > d3b30d48b76d8d7c64cda006a34f7ed3296526f2 > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java > ec9d1e07abca1bf30262e1c0f741a34741e96f5d > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java > 76460f95e058181b711fb6869f5a34c1d5bdfe31 > src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java > 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 > > src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml > f9c9ceddc559b43b4a5c45c745d54ff47484edde > > src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java > ca7c0c2675477cc727ca006697665f997972dfde > > src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java > 89765ac3d34a827d3748fb96a275d78e9d1b8b72 > src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java > 41d422939209d0808235128e4242c11e8ef25969 > src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java > 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 > > Diff: https://reviews.apache.org/r/30225/diff/ > > > Testing > ------- > > ./gradlew -Pq build > > > Thanks, > > Maxim Khutornenko > >
