Repository: mesos
Updated Branches:
  refs/heads/master 9a2c10d7d -> 454cdf42d


Documented how the replicated log works.

This is closely based on an (unpublished) blog post by Jie Yu.

Review: https://reviews.apache.org/r/43712


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/454cdf42
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/454cdf42
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/454cdf42

Branch: refs/heads/master
Commit: 454cdf42d6c9d3387391665e1f72594c48838911
Parents: 9a2c10d
Author: Neil Conway <[email protected]>
Authored: Fri Feb 5 14:14:37 2016 -0800
Committer: Jie Yu <[email protected]>
Committed: Thu Feb 18 11:14:37 2016 -0800

----------------------------------------------------------------------
 docs/configuration.md                     |   4 +-
 docs/high-availability-framework-guide.md |  10 +--
 docs/home.md                              |   1 +
 docs/images/log-architecture.png          | Bin 0 -> 21105 bytes
 docs/images/log-cluster.png               | Bin 0 -> 14415 bytes
 docs/maintenance.md                       |   4 +-
 docs/operational-guide.md                 |   2 +-
 docs/replicated-log-internals.md          | 117 +++++++++++++++++++++++++
 8 files changed, 128 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/configuration.md b/docs/configuration.md
index 801472c..b04e873 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -547,8 +547,8 @@ Currently there is no support for multiple HTTP 
authenticators. (default: basic)
     --[no-]log_auto_initialize
   </td>
   <td>
-Whether to automatically initialize the replicated log used for the
-registry. If this is set to false, the log has to be manually
+Whether to automatically initialize the [replicated 
log](replicated-log-internals.md)
+used for the registry. If this is set to false, the log has to be manually
 initialized when used for the very first time. (default: true)
   </td>
 </tr>

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/high-availability-framework-guide.md
----------------------------------------------------------------------
diff --git a/docs/high-availability-framework-guide.md 
b/docs/high-availability-framework-guide.md
index f21f95f..0d9c483 100644
--- a/docs/high-availability-framework-guide.md
+++ b/docs/high-availability-framework-guide.md
@@ -57,9 +57,9 @@ availability:
     accordingly.
 
   * Mesos actually provides ordered (but unreliable) message delivery between
-    any two pair of processes: for example, if a framework sends messages M1 
and
-    M2 to the master, the master might receive no messages, just M1, just M2, 
or
-    M1 followed by M2 -- it will _not_ receive M2 followed by M1.
+    any pair of processes: for example, if a framework sends messages M1 and M2
+    to the master, the master might receive no messages, just M1, just M2, or 
M1
+    followed by M2 -- it will _not_ receive M2 followed by M1.
 
   * As a convenience for framework authors, Mesos provides reliable delivery of
     task status updates. The agent persists task status updates to disk and 
then
@@ -136,7 +136,7 @@ Highly available framework designs typically follow a few 
common patterns:
    and pending tasks. In fact, the same coordination service that is used for
    leader election (such as ZooKeeper or etcd) can often be used for this
    purpose. Some Mesos frameworks (such as Apache Aurora) use the Mesos
-   replicated log for this purpose.
+   [replicated log](replicated-log-internals.md) for this purpose.
 
    * The data store should be used to record the actions that the scheduler
      _intends_ to take, before it takes them. For example, if a scheduler
@@ -262,7 +262,7 @@ it from the cluster. Specifically:
   task that was running on a removed agent.
 
     >NOTE: Neither the callback nor the updates are reliably delivered by the
-    master. For example if the master or scheduler fails over or there is a
+    master. For example, if the master or scheduler fails over or there is a
     network connectivity issue during the delivery of these messages, they will
     not be resent.
 

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/home.md
----------------------------------------------------------------------
diff --git a/docs/home.md b/docs/home.md
index 982ad28..07214b9 100644
--- a/docs/home.md
+++ b/docs/home.md
@@ -45,6 +45,7 @@ layout: documentation
 * [Multiple Disks](multiple-disk.md) for how to to allow tasks to use multiple 
isolated disk resources.
 * [Quota](quota.md) for how to configure Mesos to provide guaranteed resource 
allocations for use by a role.
 * [Reservation](reservation.md) for how operators and frameworks can reserve 
resources on individual agents for use by a role.
+* [Replicated Log](replicated-log-internals.md) for information on the Mesos 
replicated log.
 
 ## Running Mesos Frameworks
 

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/images/log-architecture.png
----------------------------------------------------------------------
diff --git a/docs/images/log-architecture.png b/docs/images/log-architecture.png
new file mode 100644
index 0000000..34c57f1
Binary files /dev/null and b/docs/images/log-architecture.png differ

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/images/log-cluster.png
----------------------------------------------------------------------
diff --git a/docs/images/log-cluster.png b/docs/images/log-cluster.png
new file mode 100644
index 0000000..62042d2
Binary files /dev/null and b/docs/images/log-cluster.png differ

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/maintenance.md
----------------------------------------------------------------------
diff --git a/docs/maintenance.md b/docs/maintenance.md
index e6bfe0f..4d24ec6 100644
--- a/docs/maintenance.md
+++ b/docs/maintenance.md
@@ -155,8 +155,8 @@ To cancel a maintenance schedule, the operator should post 
an empty schedule.
 
 As soon as a schedule is posted to the Mesos master, the following things 
occur:
 
-* The schedule is stored in the replicated log.  This means
-  the schedule is persisted in case of master failover.
+* The schedule is stored in the [replicated log](replicated-log-internals.md).
+  This means the schedule is persisted in case of master failover.
 * All machines in the schedule are immediately transitioned into Draining
   mode.  The mode of each machine is also persisted in the replicated log.
 * All frameworks using resources on affected agents are immediately

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/operational-guide.md
----------------------------------------------------------------------
diff --git a/docs/operational-guide.md b/docs/operational-guide.md
index 4680ee3..a4d6710 100644
--- a/docs/operational-guide.md
+++ b/docs/operational-guide.md
@@ -6,7 +6,7 @@ Mesos uses a 
"[fail-fast](https://en.wikipedia.org/wiki/Fail-fast)" approach to
 To ensure that such failures are handled appropriately, production deployments 
of Mesos typically use a _process supervisor_ (such as systemd or supervisord) 
to detect when Mesos processes exit. The supervisor can be configured to 
restart the failed process automatically and/or to notify the cluster operator 
to investigate the situation.
 
 ## Changing the master quorum
-The master leverages a Paxos-based replicated log as its storage backend 
(`--registry=replicated_log` is the only storage backend currently supported). 
Each master participates in the ensemble as a log replica. The `--quorum` flag 
determines a majority of the masters.
+The master leverages a [Paxos-based replicated 
log](replicated-log-internals.md) as its storage backend 
(`--registry=replicated_log` is the only storage backend currently supported). 
Each master participates in the ensemble as a log replica. The `--quorum` flag 
determines a majority of the masters.
 
 The following table shows the tolerance to master failures for each quorum 
size:
 

http://git-wip-us.apache.org/repos/asf/mesos/blob/454cdf42/docs/replicated-log-internals.md
----------------------------------------------------------------------
diff --git a/docs/replicated-log-internals.md b/docs/replicated-log-internals.md
new file mode 100644
index 0000000..4f379a3
--- /dev/null
+++ b/docs/replicated-log-internals.md
@@ -0,0 +1,117 @@
+---
+layout: documentation
+---
+
+# The Mesos Replicated Log
+
+Mesos provides a library that lets you create replicated fault-tolerant 
append-only logs; this library is known as the _replicated log_. The Mesos 
master uses this library to store cluster state in a replicated, durable way; 
the library is also available for use by frameworks to store replicated 
framework state or to implement the common "[replicated state 
machine](https://en.wikipedia.org/wiki/State_machine_replication)" pattern.
+
+## What is the replicated log?
+
+![Aurora and the Replicated Log](images/log-cluster.png)
+
+The replicated log provides _append-only_ storage of _log entries_; each log 
entry can contain arbitrary data. The log is _replicated_, which means that 
each log entry has multiple copies in the system. Replication provides both 
fault tolerance and high availability. In the following example, we use [Apache 
Aurora](https://aurora.apache.org/), a fault tolerant scheduler (i.e., 
framework) running on top of Mesos, to show a typical replicated log setup.
+
+As shown above, there are multiple Aurora instances running simultaneously 
(for high availability), with one elected as the leader. There is a log replica 
on each host running Aurora. Aurora can access the replicated log through a 
thin library containing the log API.
+
+Typically, the leader is the only one that appends data to the log. Each log 
entry is replicated and sent to all replicas in the system. Replicas are 
strongly consistent. In other words, all replicas agree on the value of each 
log entry. Because the log is replicated, when Aurora decides to failover, it 
does not need to copy the log from a remote host.
+
+
+## Use Cases
+
+The replicated log can be used to build a wide variety of distributed 
applications. For example, Aurora uses the replicated log to store all task 
states and job configurations. The Mesos master's _registry_ also leverages the 
replicated log to store information about all slaves in the cluster.
+
+The replicated log is often used to allow applications to manage replicated 
state in a strongly consistent way. One way to do this is to store a 
state-mutating operation in each log entry and have all instances of the 
distributed application agree on the same initial state (e.g., empty state). 
The replicated log ensures that each application instance will observe the same 
sequence of log entries in the same order; as long as applying a state-mutating 
operation is deterministic, this ensures that all application instances will 
remain consistent with one another. If any instance of the application crashes, 
it can reconstruct the current version of the replicated state by starting at 
the initial state and re-applying all the logged mutations in order.
+
+If the log grows too large, an application can write out a snapshot and then 
delete all the log entries that occurred before the snapshot. Using this 
approach, we will be exposing a [distributed 
state](https://github.com/apache/mesos/blob/master/src/state/state.hpp) 
abstraction in Mesos with replicated log as a backend.
+
+Similarly, the replicated log can be used to build [replicated state 
machines](https://en.wikipedia.org/wiki/State_machine_replication). In this 
scenario, each log entry contains a state machine command. Since replicas are 
strongly consistent, all servers will execute the same commands in the same 
order.
+
+## Implementation
+
+![Replicated Log Architecture](images/log-architecture.png)
+
+The replicated log uses the [Paxos consensus 
algorithm](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29) to 
ensure that all replicas agree on every log entry’s value. It is similar to 
what’s described in [these 
slides](https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf). Readers who 
are familiar with Paxos can skip this section.
+
+The above figure is an implementation overview. When a user wants to append 
data to the log, the system creates a log writer. The log writer internally 
creates a coordinator. The coordinator contacts all replicas and executes the 
Paxos algorithm to make sure all replicas agree about the appended data. The 
coordinator is sometimes referred to as the 
[_proposer_](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29).
+
+Each replica keeps an array of log entries. The array index is the log 
position. Each log entry is composed of three components: the value written by 
the user, the associated Paxos state and a _learned_ bit where true means this 
log entry’s value has been agreed. Therefore, a replica in our implementation 
is both an 
[_acceptor_](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29) and a 
[_learner_](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29).
+
+### Reaching consensus for a single log entry
+
+A Paxos round can help all replicas reach consensus on a single log entry’s 
value. It has two phases: a promise phase and a write phase. Note that we are 
using slightly different terminology from the [original Paxos 
paper](https://research.microsoft.com/en-us/um/people/lamport/pubs/paxes-simple.pdf).
 In our implementation, the _prepare_ and _accept_ phases in the original paper 
are referred to as the _promise_ and _write_ phases, respectively. 
Consequently, a prepare request (response) is referred to as a promise request 
(response), and an accept request (response) is referred to as a write request 
(response).
+
+To append value _X_ to the log at position _p_, the coordinator first 
broadcasts a promise request to all replicas with proposal number _n_, asking 
replicas to promise that they will not respond to any request (promise/write 
request) with a proposal number lower than _n_. We assume that _n_ is higher 
than any other previously used proposal number, and will explain how we do this 
later.
+
+When receiving the promise request, each replica checks its Paxos state to 
decide if it can safely respond to the request, depending on the promises it 
has previously given out. If the replica is able to give the promise (i.e., 
passes the proposal number check), it will first persist its promise (the 
proposal number _n_) on disk and reply with a promise response. If the replica 
has been previously written (i.e., accepted a write request), it needs to 
include the previously written value along with the proposal number used in 
that write request into the promise response it’s about to send out.
+
+Upon receiving promise responses from a 
[quorum](https://en.wikipedia.org/wiki/Quorum_%28distributed_computing%29) of 
replicas, the coordinator first checks if there exist any previously written 
value from those responses. The append operation cannot continue if a 
previously written value is found because it’s likely that a value has 
already been agreed on for that log entry. This is one of the key ideas in 
Paxos: restrict the value that can be written to ensure consistency.
+
+If no previous written value is found, the coordinator broadcasts a write 
request to all replicas with value _X_ and proposal number _n_. On receiving 
the write request, each replica checks the promise it has given again, and 
replies with a write response if the write request’s proposal number is equal 
to or larger than the proposal number it has promised. Once the coordinator 
receives write responses from a quorum of replicas, the append operation 
succeeds.
+
+### Optimizing append latency using Multi-Paxos
+
+One naive solution to implement a replicated log is to run a full Paxos round 
(promise phase and write phase) for each log entry. As discussed in the 
[original Paxos 
paper](https://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf),
 if the leader is relatively stable, _Multi-Paxos_ can be used to eliminate the 
need for the promise phase for most of the append operations, resulting in 
improved performance.
+
+To do that, we introduce a new type of promise request called an _implicit_ 
promise request. An implicit promise request can be viewed as a _batched_ 
promise request for a (potentially infinite) set of log entries. Broadcasting 
an implicit promise request is conceptually equivalent to broadcasting a 
promise request for every log entry whose value has not yet been agreed. If the 
implicit promise request broadcasted by a coordinator gets accepted by a quorum 
of replicas, this coordinator is no longer required to run the promise phase if 
it wants to append to a log entry whose value has not yet been agreed because 
the promise phase has already been done in _batch_. The coordinator in this 
case is therefore called _elected_ (a.k.a., the leader), and has _exclusive_ 
access to the replicated log. An elected coordinator may be _demoted_ (or lose 
exclusive access) if another coordinator broadcasts an implicit promise request 
with a higher proposal number.
+
+One question remaining is how can we find out those log entries whose values 
have not yet been agreed. We have a very simple solution: if a replica accepts 
an implicit promise request, it will include its largest known log position in 
the response. An elected coordinator will only append log entries at positions 
larger than _p_, where _p_ is greater than any log position seen in these 
responses.
+
+Multi-Paxos has better performance if the leader is stable. The replicated log 
itself does not perform leader election. Instead, we rely on the user of the 
replicated log to choose a stable leader. For example, Aurora uses 
[ZooKeeper](https://zookeeper.apache.org/) to elect the leader.
+
+### Enabling local reads
+
+As discussed above, in our implementation, each replica is both an acceptor 
and a learner. Treating each replica as a learner allows us to do local reads 
without involving other replicas. When a log entry’s value has been agreed, 
the coordinator will broadcast a _learned_ message to all replicas. Once a 
replica receives the learned message, it will set the learned bit in the 
corresponding log entry, indicating the value of that log entry has been 
agreed. We say a log entry is "learned" if its learned bit is set. The 
coordinator does not have to wait for replicas’ acknowledgments.
+
+To perform a read, the log reader will directly look up the underlying local 
replica. If the corresponding log entry is learned, the reader can just return 
the value to the user. Otherwise, a full Paxos round is needed to discover the 
agreed value. We always make sure that the replica co-located with the elected 
coordinator always has all log entries learned. We achieve that by running full 
Paxos rounds for those unlearned log entries after the coordinator is elected.
+
+### Reducing log size using garbage collection
+
+In case the log grows large, the application has the choice to truncate the 
log. To perform a truncation, we append a special log entry whose value is the 
log position to which the user wants to truncate the log. A replica can 
actually truncate the log once this special log entry has been learned.
+
+### Unique proposal number
+
+Many of the [Paxos research 
papers](https://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf)
 assume that each proposal number is globally unique, and a coordinator can 
always come up with a proposal number that is larger than any other proposal 
numbers in the system. However, implementing this is not trivial, especially in 
a distributed environment. [Some researchers 
suggest](https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf) 
concatenating a globally unique server id to each proposal number. But it is 
still not clear how to generate a globally unique id for each server.
+
+Our solution does not make the above assumptions. A coordinator can use an 
arbitrary proposal number initially. During the promise phase, if a replica 
knows a proposal number higher than the proposal number used by the 
coordinator, it will send the largest known proposal number back to the 
coordinator. The coordinator will retry the promise phase with a higher 
proposal number.
+
+To avoid livelock (e.g., when two coordinators completing), we inject a 
randomly delay between T and 2T before each retry. T has to be chosen 
carefully. On one hand, we want T >> broadcast time such that one coordinator 
usually times out and wins before others wake up. On the other hand, we want T 
to be as small as possible such that we can reduce the wait time. Currently, we 
use T = 100ms. This idea is actually borrowed from 
[Raft](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
+
+## Automatic replica recovery
+
+The algorithm described above has a critical vulnerability: if a replica loses 
its durable state (i.e., log files) due to either disk failure or operational 
error, that replica may cause inconsistency in the log if it is simply 
restarted and re-added to the group. The operator needs to stop the application 
on all hosts, copy the log files from the leader’s host, and then restart the 
application. Note that the operator cannot copy the log files from an arbitrary 
replica because copying an unlearned log entry may falsely assemble a quorum 
for an incorrect value, leading to inconsistency.
+
+To avoid the need for operator intervention in this situation, the Mesos 
replicated log includes support for _auto recovery_. As long as a quorum of 
replicas is working properly, the users of the application won’t notice any 
difference.
+
+### Non-voting replicas
+
+To enable auto recovery, a key insight is that a replica that loses its 
durable state should not be allowed to respond to requests from coordinators 
after restart. Otherwise, it may introduce inconsistency in the log as it could 
have accepted a promise/write request which it would not have accepted if its 
previous Paxos state had not been lost.
+
+To solve that, we introduce a new status variable for each replica. A normal 
replica is said in VOTING status, meaning that it is allowed to respond to 
requests from coordinators. A replica with no persisted state is put in EMPTY 
status by default. A replica in EMPTY status is not allowed to respond to any 
request from coordinators.
+
+A replica in EMPTY status will be promoted to VOTING status if the following 
two conditions are met:
+
+1. a sufficient amount of missing log entries are recovered such that if other 
replicas fail, the remaining replicas can recover all the learned log entries, 
and
+2. its future responses to a coordinator will not break any of the promises 
(potentially lost) it has given out.
+
+In the following, we discuss how we achieve these two conditions.
+
+### Catch-up
+
+To satisfy the above two conditions, a replica needs to perform _catch-up_ to 
recover lost states. In other words, it will run Paxos rounds to find out those 
log entries whose values that have already been agreed. The question is how 
many log entries the local replica should catch-up before the above two 
conditions can be satisfied.
+
+We found that it is sufficient to catch-up those log entries from position 
_begin_ to position _end_ where _begin_ is the smallest position seen in a 
quorum of VOTING replicas and _end_ is the largest position seen in a quorum of 
VOTING replicas.
+
+Here is our correctness argument. For a log entry at position _e_ where _e_ is 
larger than _end_, obviously no value has been agreed on. Otherwise, we should 
find at least one VOTING replica in a quorum of replicas such that its end 
position is larger than _end_. For the same reason, a coordinator should not 
have collected enough promises for the log entry at position _e_. Therefore, 
it's safe for the recovering replica to respond requests for that log entry. 
For a log entry at position _b_ where _b_ is smaller than _begin_, it should 
have already been truncated and the truncation should have already been agreed. 
Therefore, allowing the recovering replica to respond requests for that 
position is also safe.
+
+### Auto initialization
+
+Since we don’t allow an empty replica (a replica in EMPTY status) to respond 
to requests from coordinators, that raises a question for bootstrapping because 
initially, each replica is empty. The replicated log provides two choices here. 
One choice is to use a tool (`mesos-log) to explicitly initialize the log on 
each replica by setting the replica's status to VOTING, but that requires an 
extra step when setting up an application.
+
+The other choice is to do automatic initialization. Our idea is: we allow a 
replica in EMPTY status to become VOTING immediately if it finds all replicas 
are in EMPTY status. This is based on the assumption that the only time _all_ 
replicas are in EMPTY status is during start-up. This may not be true if a 
catastrophic failure causes all replicas to lose their durable state, and 
that's exactly the reason we allow conservative users to disable 
auto-initialization.
+
+To do auto-initialization, if we use a single-phase protocol and allow a 
replica to directly transit from EMPTY status to VOTING status, we may run into 
a state where we cannot make progress even if all replicas are in EMPTY status 
initially. For example, say the quorum size is 2. All replicas are in EMPTY 
status initially. One replica will first set its status to VOTING because if 
finds all replicas are in EMPTY status. After that, neither the VOTING replica 
nor the EMPTY replicas can make progress. To solve this problem, we use a 
two-phase protocol and introduce an intermediate transient status (STARTING) 
between EMPTY and VOTING status. A replica in EMPTY status can transit to 
STARTING status if it finds all replicas are in either EMPTY or STARTING 
status. A replica in STARTING status can transit to VOTING status if it finds 
all replicas are in either STARTING or VOTING status. In that way, in our 
previous example, all replicas will be in STARTING status before any of them 
can tr
 ansit to VOTING status.
+
+## Future work
+
+Currently, replicated log does not support dynamic quorum size change, also 
known as _reconfiguration_. Supporting reconfiguration would allow us more 
easily to add, move or swap hosts for replicas. We plan to support 
reconfiguration in the future.

Reply via email to