Re: Review Request 30225: Modifying update controller to support heartbeats.

Maxim Khutornenko Wed, 04 Feb 2015 18:35:13 -0800


> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote:
> > api/src/main/thrift/org/apache/aurora/gen/api.thrift, line 565
> > <https://reviews.apache.org/r/30225/diff/4/?file=848240#file848240line565>
> >
> >     Maybe s/BLOCKED/AWAITING_PULSE/?
> >     
> >     That would at least self-document and avoid additional terminology.


Renamed.


> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java, line 
> > 49
> > <https://reviews.apache.org/r/30225/diff/4/?file=848242#file848242line49>
> >
> >     s/query/fetch/ to be consistent with the method above.

I started with "fetch" but that resulted in an overloaded method that I did not 
quite liked. Changed it back.


> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
> >  line 106
> > <https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line106>
> >
> >     Remove Optional here, do Optional.of() at the call site.

Removed.


> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
> >  line 122
> > <https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line122>
> >
> >     TODO + ticket, this map needs to be exposed via a debug endpoint

Done.


> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
> >  line 273
> > <https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273>
> >
> >     I think avoid acquisition of a write lock here is a good goal to aim 
> > for.  If, for example, we are already holding the write lock for a large 
> > snapshot, we could cause it to appear as though we have not received a 
> > pulse for an extended duration.
> >     
> >     Also, given that there is likely to be an automated system on the other 
> > end, we should be somewhat defensive to a high call frequency.  This 
> > suggests to me that potentially-expensive read operations should be avoided 
> > if possible.
> >     
> >     For these reasons, i think it would be wise to decouple recording of 
> > pulses from reacting to them.  It's also apparent that there is currently 
> > no asynchronous transition to the 'blocked' states, they're triggered 
> > either by a tardy pulse or another external action.  I think this would be 
> > surprising behavior, and the scheduler should automatically transition to 
> > these states without any external input.
> 
> Maxim Khutornenko wrote:
>     "I think avoid acquisition of a write lock here is a good goal to aim 
> for.  If, for example, we are already holding the write lock for a large 
> snapshot, we could cause it to appear as though we have not received a pulse 
> for an extended duration." - makes sense, I can try to split the recording 
> part from the status update via an async action. 
>     
>     "Also, given that there is likely to be an automated system on the other 
> end, we should be somewhat defensive to a high call frequency.  This suggests 
> to me that potentially-expensive read operations should be avoided if 
> possible." - I don't see how this would be possible without breaking the RPC 
> contract as I still have to pull JobUpdateDetails in order to respond to the 
> pulse with proper status message (FINISHED|PAUSED|OK).
>     
>     
>     "It's also apparent that there is currently no asynchronous transition to 
> the 'blocked' states, they're triggered either by a tardy pulse or another 
> external action.  I think this would be surprising behavior, and the 
> scheduler should automatically transition to these states without any 
> external input." - the transition to BLOCKED state happens during the 
> evaluation loop similar to other internally triggered state transitions 
> (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states 
> do not happen instantenously and are triggered by instance activities (i.e. 
> task state change events). I don't see why BLOCKED state should be any 
> different. The update is not technically alive until the evaluation event 
> arrives. Changing this behavior just for heartbeats is not worth the extra 
> complexity and will be inconsistent with the rest of the feature.
> 
> Bill Farner wrote:
>     > I don't see how this would be possible without breaking the RPC 
> contract as I still have to pull JobUpdateDetails in order to respond to the 
> pulse with proper status message (FINISHED|PAUSED|OK)
>     
>     Yeah, this crossed my mind.  Perhaps we should poke at the API to see if 
> we can shake those requirements out?  For example, we could just return 
> OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to 
> differentiate between a paused and finished update?
>     
>     I don't feel really strongly about this, but i think we should feel free 
> to question the requirements at this phase.
>     
>     > the transition to BLOCKED state happens during the evaluation loop 
> similar to other internally triggered state transitions 
> (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states 
> do not happen instantenously and are triggered by instance activities (i.e. 
> task state change events). I don't see why BLOCKED state should be any 
> different. The update is not technically alive until the evaluation event 
> arrives. Changing this behavior just for heartbeats is not worth the extra 
> complexity and will be inconsistent with the rest of the feature
>     
>     Transitions do indeed happen asynchronously when dealing with timeouts 
> (relevant code is in `InstanceActionHandler`).  These fall back to the same 
> loop, but give the updater a time after which the state should be 
> re-evaluated.
> 
> Maxim Khutornenko wrote:
>     >Yeah, this crossed my mind.  Perhaps we should poke at the API to see if 
> we can shake those requirements out?  For example, we could just return 
> OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to 
> differentiate between a paused and finished update? I don't feel really 
> strongly about this, but i think we should feel free to question the 
> requirements at this phase.
>     
>     I don't think this would buy us much if external service starts following 
> up each pulse that returns UPDATE_NOT_ACTIVE. It would just makes the 
> contract harder to consume. If we are concerned about obuse I think having a 
> rate limiter would make more sense here. Also, I still need to pull 
> instructions to determine if the pulse is legit (i.e. the update is actually 
> a coordinated update). Otherwise, pulse goes nowhere and external service 
> continues to pulse the wrong update.
> 
> Bill Farner wrote:
>     > Also, I still need to pull instructions to determine if the pulse is 
> legit (i.e. the update is actually a coordinated update). Otherwise, pulse 
> goes nowhere and external service continues to pulse the wrong update.
>     
>     That's not _necessarily_ required, you could contain it in a `Runnable` 
> implementation that is responsible for the async transition.
> 
> Maxim Khutornenko wrote:
>     Not sure I get it. If it's in async part that means the pulse response is 
> already gone and the external service is oblivious what it actually pulsed.
> 
> David McLaughlin wrote:
>     For me the logic for pulse is fine, it's just the acquisition of the 
> write lock that is a little off. It should be this:
>     
>         pulse(JobKey key, String updateId) { 
>            // fetch the update status
>            // if the update is terminal 
>               // return GO_AWAY
>            // if the update is non-terminal
>               // update the pulse
>            
>            // if the update is currently blocked
>               // acquire write lock
>                  // change status
>         }
>         
>         
>     Give or take some operations. You could remove the fetching of the update 
> status, but then you're just asking the client to start polling the scheduler 
> every pulse instead to make the same decision.

Moved to async action to avoid delaying pulse response.


> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote:
> > src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml,
> >  line 301
> > <https://reviews.apache.org/r/30225/diff/4/?file=848249#file848249line301>
> >
> >     is it possible to reduce the diff a bit by moving this block to its 
> > former location, or is it needed

This is diff playing tricks. I moved the filter code out to reuse it with 
details select.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review71063
-----------------------------------------------------------


On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30225/
> -----------------------------------------------------------
> 
> (Updated Feb. 4, 2015, 5:24 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
> 
> 
> Bugs: AURORA-1010
>     https://issues.apache.org/jira/browse/AURORA-1010
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Added pulsing support into the JobUpdateController. The qualified coordinated 
> updates get blocked until a pulse arrives. An update then becomes active and 
> proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
> terminal state (whichever comes first).
> 
> Not particularly happy with plumbing through OneWayJobUpdater but the 
> alternative is a state machine change, which is much hairier and will require 
> more changes in the JobUpdaterController. Going with the minimal diff here.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 4f603f9e7ed004e53937a45ea8edf7241e15f5cf 
>   src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 
> 7f2ec71d744d5fafac84291cc79feba3398d0e1e 
>   src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java 
> b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java 
> d479d20259f284528b2291e2e486b36d8e47fe5e 
>   
> src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java
>  60f535988a20638fb16515d5027bfa65f1afb73c 
>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
> d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  ec9d1e07abca1bf30262e1c0f741a34741e96f5d 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 
> 76460f95e058181b711fb6869f5a34c1d5bdfe31 
>   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
> 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
>   
> src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml
>  f9c9ceddc559b43b4a5c45c745d54ff47484edde 
>   
> src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java
>  ca7c0c2675477cc727ca006697665f997972dfde 
>   
> src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java
>  89765ac3d34a827d3748fb96a275d78e9d1b8b72 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 41d422939209d0808235128e4242c11e8ef25969 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
> 
> Diff: https://reviews.apache.org/r/30225/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Re: Review Request 30225: Modifying update controller to support heartbeats.

Reply via email to