This seems like a very good idea to explore through prototyping; I am +1 on this as well.
On Wed, Jul 26, 2017 at 4:22 PM, Jordan Ly <jordan....@gmail.com> wrote: > Thanks for the comments everyone! > > Bill definitely brings up some good points. I've added additional data > to the document in order to better substantiate the claim. > > My original graph was using an incorrect query that did not specify > the correct snapshot_apply time. My new graph gives a little bit more > insight into what the time in 'scheduler_log_recover_nanos_total' was > spent doing (applying the snapshot, actually reading from leveldb, and > some time not captured by metrics). Additionally, I've added actual > logs showing what happens from Mesos disconnecting to the framework up > to the new leader reconnecting to Mesos. The other data points we have > from other failovers are consistent with this one case. Thus according > to the proposal, by keeping a follower's log and volatile store up to > date, we would be able to both: 1) eliminate the time it takes to > apply the snapshot during the actual failover and 2) reduce the amount > of time spent replaying the individual log entries (we only need to > replay from the last time a catchup was triggered). > > Echoing what David said, the implementation details would follow after > we ensure this is a reasonable plan and it is a good use of effort. > > > On Wed, Jul 26, 2017 at 3:25 PM, David McLaughlin > <dmclaugh...@apache.org> wrote: > > One thing we should make clear: we already have a working prototype for > > 'catch-up' logic in the replicated log built. The next step was to take > > this functionality and make use of it in Aurora as a proof-of-concept > > before upstreaming it. The main "threads" we're trying to explore are: > > > > 1) Reducing unplanned failovers (and API timeouts) due to stop the world > GC > > pauses. > > 2) Reducing write unavailability due to write lock contention (e.g. 40s > > snapshot times leading to API timeouts every hour) > > 3) Reducing the cost of a failover by speeding up the leader recovery > time. > > > > The proposal here is obviously targeted at (3), whereas my patches for > > snapshot deduplication and the snapshot creation proposal were aimed more > > at (2). The big idea we had for (1) was moving snapshots (and backups) > into > > followers, which would obviously require Jordan's proposal here be > shipped > > first. > > > > It wasn't clear to me how difficult this would be to add to the > Scheduler, > > so I wanted to make sure we shared our intentions before investing too > much > > effort, in case there was either some fundamental flaw in the approach or > > some easier win. > > > > > > On Wed, Jul 26, 2017 at 12:03 PM, Bill Farner <wfar...@apache.org> > wrote: > > > >> Some (hopefully) constructive criticism: > >> > >> - the doc is very high-level on the problem statement and the proposal, > >> making it difficult to agree with prioritization over cheaper snapshots > or > >> the oft-discussed support of an external DBMS. > >> > >> - the supporting data is a single data point of the > >> scheduler_log_recover_nanos_total metric. More data points and more > >> detail > >> on this data (how many entries/bytes did this represent?) would help > >> normalize the metric, and possibly indicate whether recover time is > linear > >> or non-linear. Finer-grained information would also help (where was > time > >> spent within the replay - GC? reading log entries? inflating > snapshots?). > >> > >> - the doc calls out parts (1) mesos log support and (2) scheduler > support. > >> Is the planned approach to gain value from (1) before (2), or are both > >> needed? > >> > >> - for (2) scheduler support, can you add detail on the implementation? > >> Much of the scheduler code assumes it is the leader > >> (CallOrderEnforcingStorage is currently a gatekeeper to avoid mistakes > of > >> this type), so i would caution against replaying directly into the main > >> Storage. > >> > >> > >> On Wed, Jul 26, 2017 at 1:56 PM, Santhosh Kumar Shanmugham < > >> sshanmug...@twitter.com.invalid> wrote: > >> > >> > +1 > >> > > >> > This sets up the stage for more potential benefits by offloading work > >> from > >> > the leading scheduler that consumes stable data (that is not affected > by > >> > minor inconsistencies). > >> > > >> > On Wed, Jul 26, 2017 at 10:31 AM, David McLaughlin < > >> dmclaugh...@apache.org > >> > > > >> > wrote: > >> > > >> > > I'm +1 to this approach over my proposal. With the enforced daily > >> > failover, > >> > > it's a much bigger win to make failovers "cheap" than making > snapshots > >> > > cheap, and this is going to be backwards compatible too. > >> > > > >> > > On Wed, Jul 26, 2017 at 9:51 AM, Jordan Ly <jordan....@gmail.com> > >> wrote: > >> > > > >> > > > Hello everyone! > >> > > > > >> > > > I've created a document with an initial proposal to reduce leader > >> > > > failover time by eagerly reading and replaying the replicated log > in > >> > > > followers: > >> > > > > >> > > > https://docs.google.com/document/d/10SYOq0ehLMFKQ9rX2TGC_xpM-- > >> > > > GBnstzMFP-tXGQaVI/edit?usp=sharing > >> > > > > >> > > > We wanted to open up this topic for discussion with the community > and > >> > > > see if anyone had any alternate opinions or recommendations before > >> > > > starting the work. > >> > > > > >> > > > If this solution seems reasonable, we will write and release a > design > >> > > > document for a more formal discussion and review. > >> > > > > >> > > > Please feel free to comment on the doc, or let me know if you have > >> any > >> > > > concerns. > >> > > > > >> > > > -Jordan > >> > > > > >> > > > >> > > >> > >