Re: Review Request 30225: Modifying update controller to support heartbeats.

David McLaughlin Wed, 04 Feb 2015 17:14:49 -0800


> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
> >  line 273
> > <https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273>
> >
> >     I think avoid acquisition of a write lock here is a good goal to aim 
> > for.  If, for example, we are already holding the write lock for a large 
> > snapshot, we could cause it to appear as though we have not received a 
> > pulse for an extended duration.
> >     
> >     Also, given that there is likely to be an automated system on the other 
> > end, we should be somewhat defensive to a high call frequency.  This 
> > suggests to me that potentially-expensive read operations should be avoided 
> > if possible.
> >     
> >     For these reasons, i think it would be wise to decouple recording of 
> > pulses from reacting to them.  It's also apparent that there is currently 
> > no asynchronous transition to the 'blocked' states, they're triggered 
> > either by a tardy pulse or another external action.  I think this would be 
> > surprising behavior, and the scheduler should automatically transition to 
> > these states without any external input.
> 
> Maxim Khutornenko wrote:
>     "I think avoid acquisition of a write lock here is a good goal to aim 
> for.  If, for example, we are already holding the write lock for a large 
> snapshot, we could cause it to appear as though we have not received a pulse 
> for an extended duration." - makes sense, I can try to split the recording 
> part from the status update via an async action. 
>     
>     "Also, given that there is likely to be an automated system on the other 
> end, we should be somewhat defensive to a high call frequency.  This suggests 
> to me that potentially-expensive read operations should be avoided if 
> possible." - I don't see how this would be possible without breaking the RPC 
> contract as I still have to pull JobUpdateDetails in order to respond to the 
> pulse with proper status message (FINISHED|PAUSED|OK).
>     
>     
>     "It's also apparent that there is currently no asynchronous transition to 
> the 'blocked' states, they're triggered either by a tardy pulse or another 
> external action.  I think this would be surprising behavior, and the 
> scheduler should automatically transition to these states without any 
> external input." - the transition to BLOCKED state happens during the 
> evaluation loop similar to other internally triggered state transitions 
> (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states 
> do not happen instantenously and are triggered by instance activities (i.e. 
> task state change events). I don't see why BLOCKED state should be any 
> different. The update is not technically alive until the evaluation event 
> arrives. Changing this behavior just for heartbeats is not worth the extra 
> complexity and will be inconsistent with the rest of the feature.
> 
> Bill Farner wrote:
>     > I don't see how this would be possible without breaking the RPC 
> contract as I still have to pull JobUpdateDetails in order to respond to the 
> pulse with proper status message (FINISHED|PAUSED|OK)
>     
>     Yeah, this crossed my mind.  Perhaps we should poke at the API to see if 
> we can shake those requirements out?  For example, we could just return 
> OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to 
> differentiate between a paused and finished update?
>     
>     I don't feel really strongly about this, but i think we should feel free 
> to question the requirements at this phase.
>     
>     > the transition to BLOCKED state happens during the evaluation loop 
> similar to other internally triggered state transitions 
> (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states 
> do not happen instantenously and are triggered by instance activities (i.e. 
> task state change events). I don't see why BLOCKED state should be any 
> different. The update is not technically alive until the evaluation event 
> arrives. Changing this behavior just for heartbeats is not worth the extra 
> complexity and will be inconsistent with the rest of the feature
>     
>     Transitions do indeed happen asynchronously when dealing with timeouts 
> (relevant code is in `InstanceActionHandler`).  These fall back to the same 
> loop, but give the updater a time after which the state should be 
> re-evaluated.
> 
> Maxim Khutornenko wrote:
>     >Yeah, this crossed my mind.  Perhaps we should poke at the API to see if 
> we can shake those requirements out?  For example, we could just return 
> OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to 
> differentiate between a paused and finished update? I don't feel really 
> strongly about this, but i think we should feel free to question the 
> requirements at this phase.
>     
>     I don't think this would buy us much if external service starts following 
> up each pulse that returns UPDATE_NOT_ACTIVE. It would just makes the 
> contract harder to consume. If we are concerned about obuse I think having a 
> rate limiter would make more sense here. Also, I still need to pull 
> instructions to determine if the pulse is legit (i.e. the update is actually 
> a coordinated update). Otherwise, pulse goes nowhere and external service 
> continues to pulse the wrong update.
> 
> Bill Farner wrote:
>     > Also, I still need to pull instructions to determine if the pulse is 
> legit (i.e. the update is actually a coordinated update). Otherwise, pulse 
> goes nowhere and external service continues to pulse the wrong update.
>     
>     That's not _necessarily_ required, you could contain it in a `Runnable` 
> implementation that is responsible for the async transition.
> 
> Maxim Khutornenko wrote:
>     Not sure I get it. If it's in async part that means the pulse response is 
> already gone and the external service is oblivious what it actually pulsed.


For me the logic for pulse is fine, it's just the acquisition of the write lock 
that is a little off. It should be this:

    pulse(JobKey key, String updateId) { 
       // fetch the update status
       // if the update is terminal 
          // return GO_AWAY
       // if the update is non-terminal
          // update the pulse
       
       // if the update is currently blocked
          // acquire write lock
             // change status
    }
    
    
Give or take some operations. You could remove the fetching of the update 
status, but then you're just asking the client to start polling the scheduler 
every pulse instead to make the same decision.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review71063
-----------------------------------------------------------


On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30225/
> -----------------------------------------------------------
> 
> (Updated Feb. 4, 2015, 5:24 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
> 
> 
> Bugs: AURORA-1010
>     https://issues.apache.org/jira/browse/AURORA-1010
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Added pulsing support into the JobUpdateController. The qualified coordinated 
> updates get blocked until a pulse arrives. An update then becomes active and 
> proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
> terminal state (whichever comes first).
> 
> Not particularly happy with plumbing through OneWayJobUpdater but the 
> alternative is a state machine change, which is much hairier and will require 
> more changes in the JobUpdaterController. Going with the minimal diff here.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 4f603f9e7ed004e53937a45ea8edf7241e15f5cf 
>   src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 
> 7f2ec71d744d5fafac84291cc79feba3398d0e1e 
>   src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java 
> b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java 
> d479d20259f284528b2291e2e486b36d8e47fe5e 
>   
> src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java
>  60f535988a20638fb16515d5027bfa65f1afb73c 
>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
> d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  ec9d1e07abca1bf30262e1c0f741a34741e96f5d 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 
> 76460f95e058181b711fb6869f5a34c1d5bdfe31 
>   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
> 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
>   
> src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml
>  f9c9ceddc559b43b4a5c45c745d54ff47484edde 
>   
> src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java
>  ca7c0c2675477cc727ca006697665f997972dfde 
>   
> src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java
>  89765ac3d34a827d3748fb96a275d78e9d1b8b72 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 41d422939209d0808235128e4242c11e8ef25969 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
> 
> Diff: https://reviews.apache.org/r/30225/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Re: Review Request 30225: Modifying update controller to support heartbeats.

Reply via email to