Re: Proposal: External Update Coordination

Maxim Khutornenko Wed, 15 Oct 2014 17:13:05 -0700

+1. Updated the doc:
https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md


On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfar...@apache.org> wrote:
> +1 to the scheduler not proceeding on an update when heartbeats are absent,
> and requiring the heartbeat service to explicitly call pauseJobUpdate when
> it detects problems.
>
> -=Bill
>
> On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney <kswee...@twitter.com.invalid
>> wrote:
>
>> Chatted with Maxim and Bill, I think we figured it out
>>
>> I think the confusion stems from the fact that there are two types of
>> pauses in this system, explicit, persisted pauses generated by the
>> pauseJobUpdate RPC and implicit, volatile pauses caused due to the absence
>> of a sufficiently fresh heartbeat (such as in the case of a network
>> partition).
>>
>> In case a monitoring service detects a problem it should call the explicit
>> pauseJobUpdate RPC, which will cause a state change that requires an
>> explicit resumeJobUpdate RPC to resume. That feature already exists.
>>
>> But, we need one more thing to make this reliable - heartbeats to protect
>> against network partitions between the scheduler and the monitoring
>> service. These can be volatile and lightweight - the scheduler just checks
>> for a sufficiently fresh heartbeat before it performs an update action, and
>> if none is present it simply refuses to perform the action. If the
>> partition heals a new heartbeat will arrive (if the update being monitored
>> should still be allowed to proceed) and the scheduler will allow the update
>> to proceed.
>>
>>
>> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner <wfar...@apache.org> wrote:
>>
>> > I think we should assess that after building the rest of the feature.
>> IIUC
>> > the rest of the code doesn't care if the update is initially paused.
>> >
>> > -=Bill
>> >
>> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko <ma...@apache.org>
>> > wrote:
>> >
>> > > Can we get a consensus here? Looks like the only sticky point left is
>> > > around starting an update in paused vs. non-paused state. I can argue
>> > > either way as it's easy to add later if needed.
>> > >
>> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner <wfar...@apache.org>
>> wrote:
>> > > > I'm not arguing against the merits of the approach.  Just feeling out
>> > > > whether that should be done _after_ the rest of the heartbeat
>> support.
>> > > > Seems like it can be cleanly added at the end to get something usable
>> > > > earlier.
>> > > >
>> > > > -=Bill
>> > > >
>> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney <kevi...@apache.org>
>> > > wrote:
>> > > >
>> > > >> I'm +1 for using lack of heartbeats as a uniform
>> unknown-or-unhealthy
>> > > >> signal, and punting on a more complex NACK signal (which we'd have
>> to
>> > > >> reliably persist).
>> > > >>
>> > > >> I think the only disagreement in this thread is whether the default
>> > > state
>> > > >> for a new update should be running or waiting-for-heartbeat. I think
>> > > >> waiting for a heartbeat is not only a more correct implementation
>> (no
>> > > risk
>> > > >> of acting after a failover but before the heartbeat timeout) but
>> > > simpler to
>> > > >> implement (initialize the PulseMonitor data structure as empty
>> rather
>> > > than
>> > > >> with a synthetic heartbeat).
>> > > >>
>> > > >> From an API consumer perspective the sequence is:
>> > > >>
>> > > >> 1. API client sends a startUpdate RPC to the scheduler
>> > > >> 2. API client receives an OK response, then arranges for something
>> to
>> > > call
>> > > >> heartbeat with that updateId on some interval
>> > > >> 3. Whatever is supposed to send heartbeats sends one immediately,
>> then
>> > > >> starts sending them on some smaller interval
>> > > >>
>> > > >> Waiting for the first heartbeat ensures that this sequence has been
>> > > >> completed successfully, while not waiting for it only ensure that
>> > step 1
>> > > >> has happened.
>> > > >>
>> > > >>
>> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner <wfar...@apache.org>
>> > > wrote:
>> > > >>
>> > > >> > Wait - simpler solution than what?  We're talking about not doing
>> > > either.
>> > > >> >
>> > > >> > -=Bill
>> > > >> >
>> > > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney <
>> kevi...@apache.org
>> > >
>> > > >> > wrote:
>> > > >> >
>> > > >> > > I think waiting for the first heartbeat before taking any action
>> > is
>> > > the
>> > > >> > > simpler solution here as it allows the implementation to be
>> > entirely
>> > > >> > > soft-state and still catches the bugs I described.
>> > > >> > >
>> > > >> > > The implementation is just PulseMonitorImpl<UpdateId> -
>> heartbeat
>> > > calls
>> > > >> > > pulse and mutation operations check isAlive. I think the code
>> > might
>> > > >> > > actually work as-is.
>> > > >> > >
>> > > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko <
>> > > ma...@apache.org>
>> > > >> > > wrote:
>> > > >> > >
>> > > >> > > > Pausing update on creation seems like a logical approach when
>> > > dealing
>> > > >> > > > with inverted dependency model. I.e. updater is happy to act
>> as
>> > > long
>> > > >> > > > as it's greenlighted by the external signal. It's also aligned
>> > > with a
>> > > >> > > > failover experience where coordinated updates are rehydrated
>> in
>> > > >> paused
>> > > >> > > > state waiting for HB awakening. That said, I am OK punting it
>> > for
>> > > the
>> > > >> > > > sake of simplicity for now.
>> > > >> > > >
>> > > >> > > > Kevin?
>> > > >> > > >
>> > > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner <
>> > wfar...@apache.org
>> > > >
>> > > >> > > wrote:
>> > > >> > > > > If the goal is to reduce complexity now and add features
>> > later,
>> > > why
>> > > >> > not
>> > > >> > > > > nuke both for now - kick off the update right away, and let
>> > > lack of
>> > > >> > > > > heartbeats serve as a uniform "unknown or unhealthy" signal?
>> > > >> > > > >
>> > > >> > > > > -=Bill
>> > > >> > > > >
>> > > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko <
>> > > >> ma...@apache.org
>> > > >> > >
>> > > >> > > > wrote:
>> > > >> > > > >
>> > > >> > > > >> I am still +1 on the idea to have default paused state on
>> > > >> creation.
>> > > >> > I
>> > > >> > > > >> think we could still differentiate between initially paused
>> > and
>> > > >> > timed
>> > > >> > > > >> out states internally by looking at pause reason. It's
>> quite
>> > > >> > different
>> > > >> > > > >> if we want to store explicit NACK reasons from the external
>> > > >> service
>> > > >> > > > >> though. That would require persistence and a bit more
>> > > complicated
>> > > >> > > > >> logic.
>> > > >> > > > >>
>> > > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney <
>> > > >> kevi...@apache.org>
>> > > >> > > > wrote:
>> > > >> > > > >> > I like the idea of implementing this scheduler-side
>> purely
>> > > >> through
>> > > >> > > > >> volatile
>> > > >> > > > >> > state, but the lack of feedback (generic vs specific
>> error
>> > > >> > messages
>> > > >> > > > when
>> > > >> > > > >> an
>> > > >> > > > >> > update is paused) leaves something to be desired. Maybe
>> we
>> > > can
>> > > >> > > address
>> > > >> > > > >> that
>> > > >> > > > >> > with a metadata field in the initial call to startUpdate
>> > > (with
>> > > >> an
>> > > >> > > > >> optional
>> > > >> > > > >> > link to a page where one can get more rich information
>> > about
>> > > the
>> > > >> > > > state of
>> > > >> > > > >> > the monitor sending/not sending heartbeats).
>> > > >> > > > >> >
>> > > >> > > > >> > The main drawback is that we may have to wait a maximum
>> of
>> > > one
>> > > >> > > > heartbeat
>> > > >> > > > >> > interval to find out that an update should be paused.
>> > > >> > > > >> >
>> > > >> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko <
>> > > >> > > ma...@apache.org>
>> > > >> > > > >> wrote:
>> > > >> > > > >> >
>> > > >> > > > >> >> The main reason I preferred the lack-of-ACK approach
>> over
>> > an
>> > > >> > > explicit
>> > > >> > > > >> >> NACK one is simplicity. As Joshua pointed out there is
>> > more
>> > > >> state
>> > > >> > > to
>> > > >> > > > >> >> handle in that case. The lack-of-ACK model can be
>> > completely
>> > > >> > > > >> >> implemented in volatile memory sidestepping the
>> persistent
>> > > >> > storage
>> > > >> > > > >> >> entirely. With the NACK we would need to reliably
>> persist
>> > > >> > external
>> > > >> > > > >> >> service call reasons to survive scheduler failovers.
>> Not a
>> > > huge
>> > > >> > > > >> >> challenge but something to keep in mind.
>> > > >> > > > >> >>
>> > > >> > > > >> >> I still think the simplicity/reliability tradeoff is
>> > > acceptable
>> > > >> > > here
>> > > >> > > > >> >> if we rely on external service to abort heartbeats in
>> case
>> > > of a
>> > > >> > > > health
>> > > >> > > > >> >> alert fired. This can be explicitly documented as an
>> > > external
>> > > >> > > > >> >> integration requirement. However, If the consensus is to
>> > go
>> > > a
>> > > >> > more
>> > > >> > > > >> >> reliable (though more complicated) NACK route I am happy
>> > to
>> > > >> > > > reconsider
>> > > >> > > > >> >> the current proposal.
>> > > >> > > > >> >>
>> > > >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen <
>> > > >> > > > jco...@twopensource.com>
>> > > >> > > > >> >> wrote:
>> > > >> > > > >> >> > "The heratbeatJobUpdate RPC serves as an ACK, but we
>> > don't
>> > > >> > have a
>> > > >> > > > >> NACK.
>> > > >> > > > >> >> If
>> > > >> > > > >> >> > we are going to let lack-of-ACK serve as the NACK, i
>> > don't
>> > > >> > think
>> > > >> > > > it's
>> > > >> > > > >> >> safe
>> > > >> > > > >> >> > to resume when we receive another ACK.  In other
>> words,
>> > a
>> > > >> > service
>> > > >> > > > >> >> toggling
>> > > >> > > > >> >> > unhealthy might not be deemed safe to proceed."
>> > > >> > > > >> >> >
>> > > >> > > > >> >> > Lack-of-ACK is the scenario where connectivity between
>> > the
>> > > >> > > monitor
>> > > >> > > > and
>> > > >> > > > >> >> the
>> > > >> > > > >> >> > scheduler is unavailable. Shouldn't the NACK scenario
>> > > >> > (everything
>> > > >> > > > is
>> > > >> > > > >> not
>> > > >> > > > >> >> > ok!) be handled by the monitoring service triggering
>> an
>> > > >> > explicit
>> > > >> > > > >> pause?
>> > > >> > > > >> >> > I.e. section 2 should be updated to say "External
>> > service
>> > > >> > detects
>> > > >> > > > >> service
>> > > >> > > > >> >> > health problems and pauses the update" and section 4
>> > > becomes
>> > > >> > the
>> > > >> > > > >> current
>> > > >> > > > >> >> > section 2 (i.e. "Should a heartbeat not be received
>> the
>> > > >> > scheduler
>> > > >> > > > >> pauses
>> > > >> > > > >> >> > the update.").
>> > > >> > > > >> >> >
>> > > >> > > > >> >> > I agree that it's unsafe to to resume updates after
>> > > >> receiving a
>> > > >> > > > >> heartbeat
>> > > >> > > > >> >> > after previously pausing due to a missed heartbeat. In
>> > > that
>> > > >> > > > scenario
>> > > >> > > > >> I'd
>> > > >> > > > >> >> > think we'd want an explicit resumeJobUpdate. If the
>> > > scenario
>> > > >> > > we're
>> > > >> > > > >> trying
>> > > >> > > > >> >> > to handle is *never* received a heartbeat, that's a
>> > > separate
>> > > >> > > > matter,
>> > > >> > > > >> in
>> > > >> > > > >> >> > that case unpausing upon receiving the first heartbeat
>> > > would
>> > > >> > make
>> > > >> > > > >> sense,
>> > > >> > > > >> >> > but it feels like that complicates things quite a bit
>> > > (now we
>> > > >> > > need
>> > > >> > > > to
>> > > >> > > > >> >> > differentiate between heartbeat #1 and hearbeat #N).
>> > > >> > > > >> >> >
>> > > >> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <
>> > > >> > wfar...@apache.org
>> > > >> > > >
>> > > >> > > > >> wrote:
>> > > >> > > > >> >> >
>> > > >> > > > >> >> >> What is the guidance for deploying while the
>> heartbeat
>> > > >> service
>> > > >> > > is
>> > > >> > > > >> >> broken?
>> > > >> > > > >> >> >> I think i know the answer, but it's important to
>> spell
>> > > out.
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> > Create a new coordinated job update in a paused
>> > > >> > > > >> (ROLL_FORWARD_PAUSED)
>> > > >> > > > >> >> >> > state to avoid any progress until the first
>> heartbeat
>> > > call
>> > > >> > > > arrives.
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> I'm not sold on this being ultimately beneficial.  In
>> > the
>> > > >> > worst
>> > > >> > > > case,
>> > > >> > > > >> >> >> impact is still limited by the health check
>> threshold.
>> > > >> Seems
>> > > >> > > like
>> > > >> > > > >> >> >> premature optimization at best, and an odd one if we
>> > > proceed
>> > > >> > > > without
>> > > >> > > > >> a
>> > > >> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate RPC.
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat
>> update
>> > > via
>> > > >> a
>> > > >> > > > >> >> >> > resumeJobUpdate call.
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> Are heartbeats required while rolling back?  If so,
>> > that
>> > > >> might
>> > > >> > > > impact
>> > > >> > > > >> >> the
>> > > >> > > > >> >> >> design here and in other places.
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat
>> update
>> > > via
>> > > >> a
>> > > >> > > > fresh
>> > > >> > > > >> >> >> > heartbeatJobUpdate call.
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> The heratbeatJobUpdate RPC serves as an ACK, but we
>> > don't
>> > > >> > have a
>> > > >> > > > >> NACK.
>> > > >> > > > >> >> If
>> > > >> > > > >> >> >> we are going to let lack-of-ACK serve as the NACK, i
>> > > don't
>> > > >> > think
>> > > >> > > > it's
>> > > >> > > > >> >> safe
>> > > >> > > > >> >> >> to resume when we receive another ACK.  In other
>> > words, a
>> > > >> > > service
>> > > >> > > > >> >> toggling
>> > > >> > > > >> >> >> unhealthy might not be deemed safe to proceed.
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> Perhaps just sending OK (or a NOOP equivalent) in
>> case
>> > > of a
>> > > >> > > > >> user-paused
>> > > >> > > > >> >> job
>> > > >> > > > >> >> >> > update would make more sense as there is nothing
>> > > >> monitoring
>> > > >> > > > service
>> > > >> > > > >> >> could
>> > > >> > > > >> >> >> > do in that case. This should work fine with
>> > > pause/resume
>> > > >> > > > >> >> -aware/-agnostic
>> > > >> > > > >> >> >> > monitoring service implementation.
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> This seems reasonable to me - heartbeats for a paused
>> > > update
>> > > >> > > > should
>> > > >> > > > >> not
>> > > >> > > > >> >> >> pose a risk, but can be safely ignored.
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> -=Bill
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <
>> > > >> > > > >> ma...@apache.org>
>> > > >> > > > >> >> >> wrote:
>> > > >> > > > >> >> >>
>> > > >> > > > >> >> >> > Agreed. That would be a logical generalization of
>> the
>> > > post
>> > > >> > > > failover
>> > > >> > > > >> >> >> > behavior.
>> > > >> > > > >> >> >> >
>> > > >> > > > >> >> >> > I have updated the above document with the
>> following
>> > > >> > changes:
>> > > >> > > > >> >> >> > - Reply with PAUSED any time a job was paused by
>> > user;
>> > > >> > > > >> >> >> > - Start in paused state by default.
>> > > >> > > > >> >> >> >
>> > > >> > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <
>> > > >> > > > >> kevi...@apache.org>
>> > > >> > > > >> >> >> > wrote:
>> > > >> > > > >> >> >> > > The doc mentioned that the scheduler will start
>> an
>> > > >> update
>> > > >> > > > >> subject to
>> > > >> > > > >> >> >> the
>> > > >> > > > >> >> >> > > heartbeat countdown, and if it doesn't receive a
>> > > >> heartbeat
>> > > >> > > it
>> > > >> > > > >> will
>> > > >> > > > >> >> >> pause
>> > > >> > > > >> >> >> > > the update. Why not start with the update
>> > > >> > > > >> >> paused-due-to-no-heartbeat to
>> > > >> > > > >> >> >> > > fail-fast any connectivity issues between the
>> > service
>> > > >> > > > providing
>> > > >> > > > >> the
>> > > >> > > > >> >> >> > > heartbeats and the scheduler?
>> > > >> > > > >> >> >> > >
>> > > >> > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim
>> > Khutornenko <
>> > > >> > > > >> >> ma...@apache.org>
>> > > >> > > > >> >> >> > > wrote:
>> > > >> > > > >> >> >> > >
>> > > >> > > > >> >> >> > >> Hi all,
>> > > >> > > > >> >> >> > >>
>> > > >> > > > >> >> >> > >> We are proposing a new feature for the scheduler
>> > > >> updater,
>> > > >> > > > which
>> > > >> > > > >> you
>> > > >> > > > >> >> >> > >> may find helpful.
>> > > >> > > > >> >> >> > >>
>> > > >> > > > >> >> >> > >> I have posed a brief feature summary here:
>> > > >> > > > >> >> >> > >>
>> > > >> > > > >> >> >> > >>
>> > > >> > > > >> >> >> >
>> > > >> > > > >> >> >>
>> > > >> > > > >> >>
>> > > >> > > > >>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
>> > > >> > > > >> >> >> > >>
>> > > >> > > > >> >> >> > >> Please, reply with your
>> > feedback/concerns/comments.
>> > > >> > > > >> >> >> > >>
>> > > >> > > > >> >> >> > >> Thanks,
>> > > >> > > > >> >> >> > >> Maxim
>> > > >> > > > >> >> >> > >>
>> > > >> > > > >> >> >> >
>> > > >> > > > >> >> >>
>> > > >> > > > >> >>
>> > > >> > > > >>
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>>
>>
>>
>> --
>> Kevin Sweeney
>> @kts
>>

Re: Proposal: External Update Coordination

Reply via email to