Thanks for bringing this up, Igor. I agree, Aurora availability story is long overdue for a second look. Using an external SQL instance as the longer term storage solution has been our preferred way of thinking about this problem. However, it will unlikely to materialize any time soon, primarily due to almost intractable scale issues of the current TaskStore implementation.
I am very open to any ideas here and happy to hear from Mesos folks on what's possible here. Also, if I am not mistaken, Dmitriy has explored the idea of online native log compaction that would not require a full failover. That seemed like something attainable short term without requiring deep lifecycle and architectural changes. Jie, would you happen to have any pointers towards the possibility of the online compaction in Mesos native log? Is this something that already exists but not exposed through a public API? On Mon, Aug 29, 2016 at 5:00 PM, Igor Morozov <igm...@gmail.com> wrote: > Folks, > > I'm looking at improving availability for aurora followers meaning ideally > we'd like to achieve: > 1. support active reads from aurora followers > 2. improve recovery time for aurora master failover that scales now with > the number of log entries in replicated log since the latest snapshot. > > We believe both goals could be addressed by enabling active reads from > stand-by aurora replicas and doing periodic snapshots similar to what > elected aurora master does. > > Here is my recollection of what is going on, please correct it if it is > lacking details or is simply wrong. I also cc-ed the authors of mesos > replicated log: > > So it seems the main reason aurora follower could not be used for reads is > because mesos replicated log does not guarantee that all log entries in the > acceptor/learner(essentially aurora replicated read replica) have been > "learned" in Paxos sense. The interface that mesos replicated log provides > allows range reads (from, to) and fails immediately if any replicated log > value does not have "learned" attribute set: > https://github.com/apache/mesos/blob/0d3793e94adcd6dc91d06404f20563 > 9cebd753fd/src/log/log.cpp#L421 > > Hence the design choice in aurora to use full reads from coordinator's > mesos replicated log that apparently are guaranteed to be learned: > > https://github.com/apache/aurora/blob/b24619b28c4dbb35188871bacd0091 > a9e01218e3/src/main/java/org/apache/aurora/scheduler/log/ > mesos/MesosLog.java#L227 > > There is an open ticket in mesos for providing streaming support in mesos > replicated log: > https://issues.apache.org/jira/browse/MESOS-1944 > > It seems this feature when implemented can solve most if not all concerns > that are blocking aurora follower from being used for reads. > > I don't understand this paragraph though: > > "If an unlearned position is encountered, there are couple of options. One > option is to wait until it gets learned. However, it's likely that that > position never gets learned. We also don't wanna blindly do active read > (full paxos round) because it will possibly demote the leader. One solution > is to wait for the next learned event and then do active read for the > positions in-between." > > Why would performing an active read for positions in-between while there is > new learned position available will not cause coordinator demotion? > > It looks like in principle the same semantic could be supported for mesos > log's range reads without changing its interface (I understand it may lead > to unpredictable latencies). > > Thoughts? > -Igor