Re: Review Request 30225: Modifying update controller to support heartbeats.

David McLaughlin Wed, 28 Jan 2015 19:36:20 -0800


> On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
> >  lines 259-263
> > <https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259>
> >
> >     I am unsure why this is being called inside pulse. Once pulse is 
> > activated, only the absence of a pulse can modify the update, right? We 
> > don't resume a paused update by receiving a pulse. 
> >     
> >     So surely the last pulse time would be checked externally to the method 
> > that performs the pulse? 
> >     
> >     If we can remove this, you can get rid of the write lock completely 
> > here, because all you need are strongly consistent reads (which we have) to 
> > accurately update the cooridinatedUpdateStates map correctly.
> 
> Maxim Khutornenko wrote:
>     An update blocked (not PAUSED) due to a missed pulse can be unblocked by 
> a new pulse. This covers a few important design desisions:
>     - An update can be created blocked by default awaiting for the first 
> pulse to start its progress;
>     - An occasional network partition/delay will not require an explicit 
> external service operation to resume;
>     - A scheduler restart is treated the same as initial update creation - an 
> update is rehydrated and waits for a pulse to resume;
>     
>     More details and scenarios here: 
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> 
> David McLaughlin wrote:
>     How do we show to the user (via client output or UI) that the update is 
> currently blocked?
> 
> Maxim Khutornenko wrote:
>     One possible way is described here: 
> https://issues.apache.org/jira/browse/AURORA-1049
> 
> David McLaughlin wrote:
>     I don't think this is sufficient. We'd need auditing to explain to users 
> why an update was paused (blocked) for a given time, not just the current 
> status.
> 
> Maxim Khutornenko wrote:
>     That would require persistance and changing the actual status of the job. 
> I'd rather not introduce a new state that would only be applicable to 
> specific update configurations. The more important here is the visibility 
> into the internal "transient state" to troubleshoot a coordinated job unable 
> to make progress.
> 
> David McLaughlin wrote:
>     I don't follow the line of reasoning that 100% of updates have to use a 
> feature for us to have a state to reflect it. I think in terms of simplicity 
> and consistency, it makes way more sense to have an explicit 
> UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and 
> moved to when blocked. A pulse could then have one valid state transition 
> that would require a write: UPDATE_WAITING_FOR_PULSE -> ROLLING_FORWARD. 
>     
>     The drawbacks of this compared to this approach I think would be 
> performance in the case of the external monitoring service flapping, or in 
> the case of a scheduler failover. To combat this, we could also add some kind 
> of initial grace period by setting a base timestamp at the point of leader 
> election. This timestamp could be used in the "should this be blocked?" 
> calculation when no previous pulse exists in the on-heap data structure.
> 
> Maxim Khutornenko wrote:
>     The consensus we reached on the dev list [1] is to implement the first 
> version of the feature without any persistence at all. We agreed to look into 
> reacher status reporting later as/if needed. 
>     
>     IMO, adding a new state is a more involved solution. We would need to 
> revisit the update state graph, figure out and validate the new rules for 
> user pause/resume actions wrt the new state and pulse. Not impossible but 
> hardly a blocking requirement either.
>     
>     [1] - 
> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser


On that email thread, my last comment is to bring up this exact point :)

I'm -1 on proceeding without auditing for blocked state. But if the majority 
disagrees, then that's fine and I'm fine for this to proceed with another ship 
it.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
-----------------------------------------------------------


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30225/
> -----------------------------------------------------------
> 
> (Updated Jan. 23, 2015, 8:37 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
> 
> 
> Bugs: AURORA-1010
>     https://issues.apache.org/jira/browse/AURORA-1010
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Added pulsing support into the JobUpdateController. The qualified coordinated 
> updates get blocked until a pulse arrives. An update then becomes active and 
> proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a 
> terminal state (whichever comes first).
> 
> Not particularly happy with plumbing through OneWayJobUpdater but the 
> alternative is a state machine change, which is much hairier and will require 
> more changes in the JobUpdaterController. Going with the minimal diff here.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java 
> d3b30d48b76d8d7c64cda006a34f7ed3296526f2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  a992938d4e12b20f81608be6bbdc24c0a211c3fd 
>   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
> 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 4c827b183a87b4d97774edbfaa960bd1c3de72a5 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 
> 
> Diff: https://reviews.apache.org/r/30225/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Re: Review Request 30225: Modifying update controller to support heartbeats.

Reply via email to