Hi everyone, Following up on the discussion here: https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043eadfc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E
I've created a design document detailing the implementation of a "hot standby" mechanism where scheduler followers would eagerly read and apply entries from the replicated log. The goal of this change is that, in the event of a failover, the newly elected follower will not have to replay as many entries to rebuild its state and thus can start serving traffic faster. https://docs.google.com/document/d/1DOtKA4-vrtxat1MaUYMQ6Y1iXhA8ob6Mfztzt-R1Oss/edit?usp=sharing I have a working prototype of the above design running on a test cluster. Please feel free to comment on the doc! This document references a current proposal in Mesos by Ilya Pronin here: https://lists.apache.org/thread.html/1b8fd10e151054a85c9ea3dc808f7fecb9a87fe5f5e87b10caa46e2a@%3Cdev.mesos.apache.org%3E Cheers, Jordan Ly