+1. Updated the doc: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfar...@apache.org> wrote: > +1 to the scheduler not proceeding on an update when heartbeats are absent, > and requiring the heartbeat service to explicitly call pauseJobUpdate when > it detects problems. > > -=Bill > > On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney <kswee...@twitter.com.invalid >> wrote: > >> Chatted with Maxim and Bill, I think we figured it out >> >> I think the confusion stems from the fact that there are two types of >> pauses in this system, explicit, persisted pauses generated by the >> pauseJobUpdate RPC and implicit, volatile pauses caused due to the absence >> of a sufficiently fresh heartbeat (such as in the case of a network >> partition). >> >> In case a monitoring service detects a problem it should call the explicit >> pauseJobUpdate RPC, which will cause a state change that requires an >> explicit resumeJobUpdate RPC to resume. That feature already exists. >> >> But, we need one more thing to make this reliable - heartbeats to protect >> against network partitions between the scheduler and the monitoring >> service. These can be volatile and lightweight - the scheduler just checks >> for a sufficiently fresh heartbeat before it performs an update action, and >> if none is present it simply refuses to perform the action. If the >> partition heals a new heartbeat will arrive (if the update being monitored >> should still be allowed to proceed) and the scheduler will allow the update >> to proceed. >> >> >> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner <wfar...@apache.org> wrote: >> >> > I think we should assess that after building the rest of the feature. >> IIUC >> > the rest of the code doesn't care if the update is initially paused. >> > >> > -=Bill >> > >> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko <ma...@apache.org> >> > wrote: >> > >> > > Can we get a consensus here? Looks like the only sticky point left is >> > > around starting an update in paused vs. non-paused state. I can argue >> > > either way as it's easy to add later if needed. >> > > >> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner <wfar...@apache.org> >> wrote: >> > > > I'm not arguing against the merits of the approach. Just feeling out >> > > > whether that should be done _after_ the rest of the heartbeat >> support. >> > > > Seems like it can be cleanly added at the end to get something usable >> > > > earlier. >> > > > >> > > > -=Bill >> > > > >> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney <kevi...@apache.org> >> > > wrote: >> > > > >> > > >> I'm +1 for using lack of heartbeats as a uniform >> unknown-or-unhealthy >> > > >> signal, and punting on a more complex NACK signal (which we'd have >> to >> > > >> reliably persist). >> > > >> >> > > >> I think the only disagreement in this thread is whether the default >> > > state >> > > >> for a new update should be running or waiting-for-heartbeat. I think >> > > >> waiting for a heartbeat is not only a more correct implementation >> (no >> > > risk >> > > >> of acting after a failover but before the heartbeat timeout) but >> > > simpler to >> > > >> implement (initialize the PulseMonitor data structure as empty >> rather >> > > than >> > > >> with a synthetic heartbeat). >> > > >> >> > > >> From an API consumer perspective the sequence is: >> > > >> >> > > >> 1. API client sends a startUpdate RPC to the scheduler >> > > >> 2. API client receives an OK response, then arranges for something >> to >> > > call >> > > >> heartbeat with that updateId on some interval >> > > >> 3. Whatever is supposed to send heartbeats sends one immediately, >> then >> > > >> starts sending them on some smaller interval >> > > >> >> > > >> Waiting for the first heartbeat ensures that this sequence has been >> > > >> completed successfully, while not waiting for it only ensure that >> > step 1 >> > > >> has happened. >> > > >> >> > > >> >> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner <wfar...@apache.org> >> > > wrote: >> > > >> >> > > >> > Wait - simpler solution than what? We're talking about not doing >> > > either. >> > > >> > >> > > >> > -=Bill >> > > >> > >> > > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney < >> kevi...@apache.org >> > > >> > > >> > wrote: >> > > >> > >> > > >> > > I think waiting for the first heartbeat before taking any action >> > is >> > > the >> > > >> > > simpler solution here as it allows the implementation to be >> > entirely >> > > >> > > soft-state and still catches the bugs I described. >> > > >> > > >> > > >> > > The implementation is just PulseMonitorImpl<UpdateId> - >> heartbeat >> > > calls >> > > >> > > pulse and mutation operations check isAlive. I think the code >> > might >> > > >> > > actually work as-is. >> > > >> > > >> > > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko < >> > > ma...@apache.org> >> > > >> > > wrote: >> > > >> > > >> > > >> > > > Pausing update on creation seems like a logical approach when >> > > dealing >> > > >> > > > with inverted dependency model. I.e. updater is happy to act >> as >> > > long >> > > >> > > > as it's greenlighted by the external signal. It's also aligned >> > > with a >> > > >> > > > failover experience where coordinated updates are rehydrated >> in >> > > >> paused >> > > >> > > > state waiting for HB awakening. That said, I am OK punting it >> > for >> > > the >> > > >> > > > sake of simplicity for now. >> > > >> > > > >> > > >> > > > Kevin? >> > > >> > > > >> > > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner < >> > wfar...@apache.org >> > > > >> > > >> > > wrote: >> > > >> > > > > If the goal is to reduce complexity now and add features >> > later, >> > > why >> > > >> > not >> > > >> > > > > nuke both for now - kick off the update right away, and let >> > > lack of >> > > >> > > > > heartbeats serve as a uniform "unknown or unhealthy" signal? >> > > >> > > > > >> > > >> > > > > -=Bill >> > > >> > > > > >> > > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko < >> > > >> ma...@apache.org >> > > >> > > >> > > >> > > > wrote: >> > > >> > > > > >> > > >> > > > >> I am still +1 on the idea to have default paused state on >> > > >> creation. >> > > >> > I >> > > >> > > > >> think we could still differentiate between initially paused >> > and >> > > >> > timed >> > > >> > > > >> out states internally by looking at pause reason. It's >> quite >> > > >> > different >> > > >> > > > >> if we want to store explicit NACK reasons from the external >> > > >> service >> > > >> > > > >> though. That would require persistence and a bit more >> > > complicated >> > > >> > > > >> logic. >> > > >> > > > >> >> > > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney < >> > > >> kevi...@apache.org> >> > > >> > > > wrote: >> > > >> > > > >> > I like the idea of implementing this scheduler-side >> purely >> > > >> through >> > > >> > > > >> volatile >> > > >> > > > >> > state, but the lack of feedback (generic vs specific >> error >> > > >> > messages >> > > >> > > > when >> > > >> > > > >> an >> > > >> > > > >> > update is paused) leaves something to be desired. Maybe >> we >> > > can >> > > >> > > address >> > > >> > > > >> that >> > > >> > > > >> > with a metadata field in the initial call to startUpdate >> > > (with >> > > >> an >> > > >> > > > >> optional >> > > >> > > > >> > link to a page where one can get more rich information >> > about >> > > the >> > > >> > > > state of >> > > >> > > > >> > the monitor sending/not sending heartbeats). >> > > >> > > > >> > >> > > >> > > > >> > The main drawback is that we may have to wait a maximum >> of >> > > one >> > > >> > > > heartbeat >> > > >> > > > >> > interval to find out that an update should be paused. >> > > >> > > > >> > >> > > >> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko < >> > > >> > > ma...@apache.org> >> > > >> > > > >> wrote: >> > > >> > > > >> > >> > > >> > > > >> >> The main reason I preferred the lack-of-ACK approach >> over >> > an >> > > >> > > explicit >> > > >> > > > >> >> NACK one is simplicity. As Joshua pointed out there is >> > more >> > > >> state >> > > >> > > to >> > > >> > > > >> >> handle in that case. The lack-of-ACK model can be >> > completely >> > > >> > > > >> >> implemented in volatile memory sidestepping the >> persistent >> > > >> > storage >> > > >> > > > >> >> entirely. With the NACK we would need to reliably >> persist >> > > >> > external >> > > >> > > > >> >> service call reasons to survive scheduler failovers. >> Not a >> > > huge >> > > >> > > > >> >> challenge but something to keep in mind. >> > > >> > > > >> >> >> > > >> > > > >> >> I still think the simplicity/reliability tradeoff is >> > > acceptable >> > > >> > > here >> > > >> > > > >> >> if we rely on external service to abort heartbeats in >> case >> > > of a >> > > >> > > > health >> > > >> > > > >> >> alert fired. This can be explicitly documented as an >> > > external >> > > >> > > > >> >> integration requirement. However, If the consensus is to >> > go >> > > a >> > > >> > more >> > > >> > > > >> >> reliable (though more complicated) NACK route I am happy >> > to >> > > >> > > > reconsider >> > > >> > > > >> >> the current proposal. >> > > >> > > > >> >> >> > > >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen < >> > > >> > > > jco...@twopensource.com> >> > > >> > > > >> >> wrote: >> > > >> > > > >> >> > "The heratbeatJobUpdate RPC serves as an ACK, but we >> > don't >> > > >> > have a >> > > >> > > > >> NACK. >> > > >> > > > >> >> If >> > > >> > > > >> >> > we are going to let lack-of-ACK serve as the NACK, i >> > don't >> > > >> > think >> > > >> > > > it's >> > > >> > > > >> >> safe >> > > >> > > > >> >> > to resume when we receive another ACK. In other >> words, >> > a >> > > >> > service >> > > >> > > > >> >> toggling >> > > >> > > > >> >> > unhealthy might not be deemed safe to proceed." >> > > >> > > > >> >> > >> > > >> > > > >> >> > Lack-of-ACK is the scenario where connectivity between >> > the >> > > >> > > monitor >> > > >> > > > and >> > > >> > > > >> >> the >> > > >> > > > >> >> > scheduler is unavailable. Shouldn't the NACK scenario >> > > >> > (everything >> > > >> > > > is >> > > >> > > > >> not >> > > >> > > > >> >> > ok!) be handled by the monitoring service triggering >> an >> > > >> > explicit >> > > >> > > > >> pause? >> > > >> > > > >> >> > I.e. section 2 should be updated to say "External >> > service >> > > >> > detects >> > > >> > > > >> service >> > > >> > > > >> >> > health problems and pauses the update" and section 4 >> > > becomes >> > > >> > the >> > > >> > > > >> current >> > > >> > > > >> >> > section 2 (i.e. "Should a heartbeat not be received >> the >> > > >> > scheduler >> > > >> > > > >> pauses >> > > >> > > > >> >> > the update."). >> > > >> > > > >> >> > >> > > >> > > > >> >> > I agree that it's unsafe to to resume updates after >> > > >> receiving a >> > > >> > > > >> heartbeat >> > > >> > > > >> >> > after previously pausing due to a missed heartbeat. In >> > > that >> > > >> > > > scenario >> > > >> > > > >> I'd >> > > >> > > > >> >> > think we'd want an explicit resumeJobUpdate. If the >> > > scenario >> > > >> > > we're >> > > >> > > > >> trying >> > > >> > > > >> >> > to handle is *never* received a heartbeat, that's a >> > > separate >> > > >> > > > matter, >> > > >> > > > >> in >> > > >> > > > >> >> > that case unpausing upon receiving the first heartbeat >> > > would >> > > >> > make >> > > >> > > > >> sense, >> > > >> > > > >> >> > but it feels like that complicates things quite a bit >> > > (now we >> > > >> > > need >> > > >> > > > to >> > > >> > > > >> >> > differentiate between heartbeat #1 and hearbeat #N). >> > > >> > > > >> >> > >> > > >> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner < >> > > >> > wfar...@apache.org >> > > >> > > > >> > > >> > > > >> wrote: >> > > >> > > > >> >> > >> > > >> > > > >> >> >> What is the guidance for deploying while the >> heartbeat >> > > >> service >> > > >> > > is >> > > >> > > > >> >> broken? >> > > >> > > > >> >> >> I think i know the answer, but it's important to >> spell >> > > out. >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> > Create a new coordinated job update in a paused >> > > >> > > > >> (ROLL_FORWARD_PAUSED) >> > > >> > > > >> >> >> > state to avoid any progress until the first >> heartbeat >> > > call >> > > >> > > > arrives. >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> I'm not sold on this being ultimately beneficial. In >> > the >> > > >> > worst >> > > >> > > > case, >> > > >> > > > >> >> >> impact is still limited by the health check >> threshold. >> > > >> Seems >> > > >> > > like >> > > >> > > > >> >> >> premature optimization at best, and an odd one if we >> > > proceed >> > > >> > > > without >> > > >> > > > >> a >> > > >> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate RPC. >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat >> update >> > > via >> > > >> a >> > > >> > > > >> >> >> > resumeJobUpdate call. >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> Are heartbeats required while rolling back? If so, >> > that >> > > >> might >> > > >> > > > impact >> > > >> > > > >> >> the >> > > >> > > > >> >> >> design here and in other places. >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat >> update >> > > via >> > > >> a >> > > >> > > > fresh >> > > >> > > > >> >> >> > heartbeatJobUpdate call. >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> The heratbeatJobUpdate RPC serves as an ACK, but we >> > don't >> > > >> > have a >> > > >> > > > >> NACK. >> > > >> > > > >> >> If >> > > >> > > > >> >> >> we are going to let lack-of-ACK serve as the NACK, i >> > > don't >> > > >> > think >> > > >> > > > it's >> > > >> > > > >> >> safe >> > > >> > > > >> >> >> to resume when we receive another ACK. In other >> > words, a >> > > >> > > service >> > > >> > > > >> >> toggling >> > > >> > > > >> >> >> unhealthy might not be deemed safe to proceed. >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> Perhaps just sending OK (or a NOOP equivalent) in >> case >> > > of a >> > > >> > > > >> user-paused >> > > >> > > > >> >> job >> > > >> > > > >> >> >> > update would make more sense as there is nothing >> > > >> monitoring >> > > >> > > > service >> > > >> > > > >> >> could >> > > >> > > > >> >> >> > do in that case. This should work fine with >> > > pause/resume >> > > >> > > > >> >> -aware/-agnostic >> > > >> > > > >> >> >> > monitoring service implementation. >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> This seems reasonable to me - heartbeats for a paused >> > > update >> > > >> > > > should >> > > >> > > > >> not >> > > >> > > > >> >> >> pose a risk, but can be safely ignored. >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> -=Bill >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko < >> > > >> > > > >> ma...@apache.org> >> > > >> > > > >> >> >> wrote: >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> > Agreed. That would be a logical generalization of >> the >> > > post >> > > >> > > > failover >> > > >> > > > >> >> >> > behavior. >> > > >> > > > >> >> >> > >> > > >> > > > >> >> >> > I have updated the above document with the >> following >> > > >> > changes: >> > > >> > > > >> >> >> > - Reply with PAUSED any time a job was paused by >> > user; >> > > >> > > > >> >> >> > - Start in paused state by default. >> > > >> > > > >> >> >> > >> > > >> > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney < >> > > >> > > > >> kevi...@apache.org> >> > > >> > > > >> >> >> > wrote: >> > > >> > > > >> >> >> > > The doc mentioned that the scheduler will start >> an >> > > >> update >> > > >> > > > >> subject to >> > > >> > > > >> >> >> the >> > > >> > > > >> >> >> > > heartbeat countdown, and if it doesn't receive a >> > > >> heartbeat >> > > >> > > it >> > > >> > > > >> will >> > > >> > > > >> >> >> pause >> > > >> > > > >> >> >> > > the update. Why not start with the update >> > > >> > > > >> >> paused-due-to-no-heartbeat to >> > > >> > > > >> >> >> > > fail-fast any connectivity issues between the >> > service >> > > >> > > > providing >> > > >> > > > >> the >> > > >> > > > >> >> >> > > heartbeats and the scheduler? >> > > >> > > > >> >> >> > > >> > > >> > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim >> > Khutornenko < >> > > >> > > > >> >> ma...@apache.org> >> > > >> > > > >> >> >> > > wrote: >> > > >> > > > >> >> >> > > >> > > >> > > > >> >> >> > >> Hi all, >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> > >> We are proposing a new feature for the scheduler >> > > >> updater, >> > > >> > > > which >> > > >> > > > >> you >> > > >> > > > >> >> >> > >> may find helpful. >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> > >> I have posed a brief feature summary here: >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> > >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> > > >> > > > >> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> > >> Please, reply with your >> > feedback/concerns/comments. >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> > >> Thanks, >> > > >> > > > >> >> >> > >> Maxim >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> > >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> > > >> > > > >> >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> >> >> >> -- >> Kevin Sweeney >> @kts >>