Re: Reducing Failover Time by Eagerly Reading/Replaying Log in Followers

Mehrdad Nurolahzade Wed, 26 Jul 2017 17:58:48 -0700

This seems like a very good idea to explore through prototyping; I am +1 on
this as well.


On Wed, Jul 26, 2017 at 4:22 PM, Jordan Ly <jordan....@gmail.com> wrote:

> Thanks for the comments everyone!
>
> Bill definitely brings up some good points. I've added additional data
> to the document in order to better substantiate the claim.
>
> My original graph was using an incorrect query that did not specify
> the correct snapshot_apply time. My new graph gives a little bit more
> insight into what the time in 'scheduler_log_recover_nanos_total' was
> spent doing (applying the snapshot, actually reading from leveldb, and
> some time not captured by metrics). Additionally, I've added actual
> logs showing what happens from Mesos disconnecting to the framework up
> to the new leader reconnecting to Mesos. The other data points we have
> from other failovers are consistent with this one case. Thus according
> to the proposal, by keeping a follower's log and volatile store up to
> date, we would be able to both: 1) eliminate the time it takes to
> apply the snapshot during the actual failover and 2) reduce the amount
> of time spent replaying the individual log entries (we only need to
> replay from the last time a catchup was triggered).
>
> Echoing what David said, the implementation details would follow after
> we ensure this is a reasonable plan and it is a good use of effort.
>
>
> On Wed, Jul 26, 2017 at 3:25 PM, David McLaughlin
> <dmclaugh...@apache.org> wrote:
> > One thing we should make clear: we already have a working prototype for
> > 'catch-up' logic in the replicated log built. The next step was to take
> > this functionality and make use of it in Aurora as a proof-of-concept
> > before upstreaming it. The main "threads" we're trying to explore are:
> >
> > 1) Reducing unplanned failovers (and API timeouts) due to stop the world
> GC
> > pauses.
> > 2) Reducing write unavailability due to write lock contention (e.g. 40s
> > snapshot times leading to API timeouts every hour)
> > 3) Reducing the cost of a failover by speeding up the leader recovery
> time.
> >
> > The proposal here is obviously targeted at (3), whereas my patches for
> > snapshot deduplication and the snapshot creation proposal were aimed more
> > at (2). The big idea we had for (1) was moving snapshots (and backups)
> into
> > followers, which would obviously require Jordan's proposal here be
> shipped
> > first.
> >
> > It wasn't clear to me how difficult this would be to add to the
> Scheduler,
> > so I wanted to make sure we shared our intentions before investing too
> much
> > effort, in case there was either some fundamental flaw in the approach or
> > some easier win.
> >
> >
> > On Wed, Jul 26, 2017 at 12:03 PM, Bill Farner <wfar...@apache.org>
> wrote:
> >
> >> Some (hopefully) constructive criticism:
> >>
> >> - the doc is very high-level on the problem statement and the proposal,
> >> making it difficult to agree with prioritization over cheaper snapshots
> or
> >> the oft-discussed support of an external DBMS.
> >>
> >> - the supporting data is a single data point of the
> >> scheduler_log_recover_nanos_total metric.  More data points and more
> >> detail
> >> on this data (how many entries/bytes did this represent?) would help
> >> normalize the metric, and possibly indicate whether recover time is
> linear
> >> or non-linear.  Finer-grained information would also help (where was
> time
> >> spent within the replay - GC?  reading log entries?  inflating
> snapshots?).
> >>
> >> - the doc calls out parts (1) mesos log support and (2) scheduler
> support.
> >> Is the planned approach to gain value from (1) before (2), or are both
> >> needed?
> >>
> >> - for (2) scheduler support, can you add detail on the implementation?
> >> Much of the scheduler code assumes it is the leader
> >> (CallOrderEnforcingStorage is currently a gatekeeper to avoid mistakes
> of
> >> this type), so i would caution against replaying directly into the main
> >> Storage.
> >>
> >>
> >> On Wed, Jul 26, 2017 at 1:56 PM, Santhosh Kumar Shanmugham <
> >> sshanmug...@twitter.com.invalid> wrote:
> >>
> >> > +1
> >> >
> >> > This sets up the stage for more potential benefits by offloading work
> >> from
> >> > the leading scheduler that consumes stable data (that is not affected
> by
> >> > minor inconsistencies).
> >> >
> >> > On Wed, Jul 26, 2017 at 10:31 AM, David McLaughlin <
> >> dmclaugh...@apache.org
> >> > >
> >> > wrote:
> >> >
> >> > > I'm +1 to this approach over my proposal. With the enforced daily
> >> > failover,
> >> > > it's a much bigger win to make failovers "cheap" than making
> snapshots
> >> > > cheap, and this is going to be backwards compatible too.
> >> > >
> >> > > On Wed, Jul 26, 2017 at 9:51 AM, Jordan Ly <jordan....@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hello everyone!
> >> > > >
> >> > > > I've created a document with an initial proposal to reduce leader
> >> > > > failover time by eagerly reading and replaying the replicated log
> in
> >> > > > followers:
> >> > > >
> >> > > > https://docs.google.com/document/d/10SYOq0ehLMFKQ9rX2TGC_xpM--
> >> > > > GBnstzMFP-tXGQaVI/edit?usp=sharing
> >> > > >
> >> > > > We wanted to open up this topic for discussion with the community
> and
> >> > > > see if anyone had any alternate opinions or recommendations before
> >> > > > starting the work.
> >> > > >
> >> > > > If this solution seems reasonable, we will write and release a
> design
> >> > > > document for a more formal discussion and review.
> >> > > >
> >> > > > Please feel free to comment on the doc, or let me know if you have
> >> any
> >> > > > concerns.
> >> > > >
> >> > > > -Jordan
> >> > > >
> >> > >
> >> >
> >>
>
>

Re: Reducing Failover Time by Eagerly Reading/Replaying Log in Followers

Reply via email to