Folks, I'm looking at improving availability for aurora followers meaning ideally we'd like to achieve: 1. support active reads from aurora followers 2. improve recovery time for aurora master failover that scales now with the number of log entries in replicated log since the latest snapshot.
We believe both goals could be addressed by enabling active reads from stand-by aurora replicas and doing periodic snapshots similar to what elected aurora master does. Here is my recollection of what is going on, please correct it if it is lacking details or is simply wrong. I also cc-ed the authors of mesos replicated log: So it seems the main reason aurora follower could not be used for reads is because mesos replicated log does not guarantee that all log entries in the acceptor/learner(essentially aurora replicated read replica) have been "learned" in Paxos sense. The interface that mesos replicated log provides allows range reads (from, to) and fails immediately if any replicated log value does not have "learned" attribute set: https://github.com/apache/mesos/blob/0d3793e94adcd6dc91d06404f20563 9cebd753fd/src/log/log.cpp#L421 Hence the design choice in aurora to use full reads from coordinator's mesos replicated log that apparently are guaranteed to be learned: https://github.com/apache/aurora/blob/b24619b28c4dbb35188871bacd0091 a9e01218e3/src/main/java/org/apache/aurora/scheduler/log/ mesos/MesosLog.java#L227 There is an open ticket in mesos for providing streaming support in mesos replicated log: https://issues.apache.org/jira/browse/MESOS-1944 It seems this feature when implemented can solve most if not all concerns that are blocking aurora follower from being used for reads. I don't understand this paragraph though: "If an unlearned position is encountered, there are couple of options. One option is to wait until it gets learned. However, it's likely that that position never gets learned. We also don't wanna blindly do active read (full paxos round) because it will possibly demote the leader. One solution is to wait for the next learned event and then do active read for the positions in-between." Why would performing an active read for positions in-between while there is new learned position available will not cause coordinator demotion? It looks like in principle the same semantic could be supported for mesos log's range reads without changing its interface (I understand it may lead to unpredictable latencies). Thoughts? -Igor