[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230331#comment-14230331
]
Blake Eggleston commented on CASSANDRA-6246:
--------------------------------------------
Since it looks like the performance improvements from epaxos could be worth the
(substantial) added complexity, I’ve been thinking through problems are caused
by the need to garbage collect instances, and repair causing inconsistencies
by sending data from ‘the future’.
For repair, the only thing I’ve thought of that would work 100% of the time
would be to count executed instances for a partition, and to send that count
along with the repair request. If the remote count is higher than the local
count, we know for sure that it has data from the future, and the repair for
that partition should be deferred.
For garbage collection, we’ll need to support a failure recovery mode that
works without all historical instances. We also need a way to quickly determine
if a prepare phase should be used, or we need a epaxos repair type operation to
bring a node up to speed.
Breaking the continuous execution space of partition ranges into discrete
epochs would give us a relatively straightforward way of solving all of these
problems. Each partition range will have it’s own epoch number. At a given
instance number threshold, time threshold, or event, epaxos will run an epoch
increment instance. It will take every active instance in it’s partition range
as a dependency. Any instance executed before the epoch instance belongs to the
last epoch, any executed after belong to the new one.
How this would solve the outstanding problems:
Garbage Collection: Any instance from 2 or more epochs ago can be deleted.
Although epoch incrementing instances doesn’t prevent dependencies on the
previous epoch, it does prevent dependencies from the previous-1 epoch
Repair: Counting executions allows us to determine if repair data is from the
future. Epochs let us scope execution counts to an epoch. If the epoch has
incremented twice without new executions for a partition, the bookkeeping data
for that partition can be deleted. This gives us a race free way to delete old
execution counts, preventing keeping bookkeeping data around forever.
Failure recovery: Using epochs makes deciding to use prepare or failure
recovery unambiguous. If a node is missing instances that are from 2 or more
epochs ago, it will need to run a failure recovery. Otherwise, prepare phases
will work. Additionally, using an epaxos instance as the method of incrementing
epochs guarantees that a given instance has been executed once the epoch has
been incremented twice.
> EPaxos
> ------
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Blake Eggleston
> Priority: Minor
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is
> that Multi-paxos requires leader election and hence, a period of
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos,
> (2) is particularly useful across multiple datacenters, and (3) allows any
> node to act as coordinator:
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to
> implement it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)