[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574539#comment-14574539
 ] 

Blake Eggleston commented on CASSANDRA-6246:
--------------------------------------------

So I think this is at a point where it's ready for review.  Epaxos rebased 
against trunk is here: 
https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The 
supporting dtests are here: 
https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246.

Most of the problems and solutions for supporting repair, read repair, 
bootstrap and failure recovery are discussed above, so I won't go into it here.

The upgrade from current paxos to epaxos is opt in, and is done via nodetool 
upgradepaxos. That needs to be run on each node after the cluster has been 
upgraded, and will transition a cluster from paxos to epaxos. Serialized 
queries are still processed during the upgrade. How that works is explained 
here: 
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/src/java/org/apache/cassandra/service/epaxos/UpgradeService.java#L83.
 For new clusters, or nodes being added to upgrading clusters, there's a yaml 
file parameter you can set which will make the cluster.

Epaxos keeps the metadata for SERIAL and LOCAL_SERIAL queries completely 
separate. Keeping track of the two with the same set of metadata would have 
introduced a ton of complexity. Obviously mixing the two is a bad idea, and you 
give up the serialization guarantees when you do it anyway, so epaxos doesn't 
even bother trying.

There is some weird stuff going on with how some things are serialized. First, 
for cas requests, the statement objects aren't serialized when attached to 
instances... the query strings and their parameters are. This is because 
serializing the important parts of a CQLCasRequest would have required 
serializers for dozens of classes that I'm not familiar with, and that didn't 
seem to be intended for serialization. Serializing the query string and 
parameter bytes is a lot less elegant, but a lot more reliable. Hopefully 8099 
will make it a bit easier to do this correctly. Second, most of the epaxos 
metadata is persisted as blobs. This is mainly because the dependency graph 
meta data is very queue like in it usage patterns, so dumping it as a big blob 
when there are changes prevents a lot of headaches. For the instances, since 
there are 3 types, each with different attributes, it seemed less risky to 
maintain a single serializer vs a serializer and select/insert statements. At 
this point, it would probably be ok to break it out into a proper table, but I 
don't have strong feelings about it either way. The token meta data was saved 
as blobs just because the other two were, iirc.

Regarding performance, I've run some tests today, this time with one and two 
datacenters and with more concurrent clients. For a single datacenter, the 
epaxos median response time is 40-50% faster than regular paxos. However, the 
95th and 99th percentiles are actually worse. I'm not sure why that is, but 
will be looking into that in the next week or so. In multiple datacenters, 
epaxos is 70-75% faster than regular paxos, and the 95th & 99th percentiles are 
50-70% faster as well. I haven't tested contended performance yet, and will do 
those in the next week or so. I'd expect them to be similar to last time though.

The patch is about 50% tests. I've tried to be very thorough. The core epaxos 
services wrap calls to singletons in protected methods to make testing the 
interaction of multiple nodes in unit tests straightforward. In addition to 
dtests, there are a bunch of junit tests that put a simulated cluster in 
strange and probably rare failure conditions to test recovery, like this one: 
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/unit/org/apache/cassandra/service/epaxos/integration/EpaxosIntegrationRF3Test.java#L144-144.
I've also written a test that executes thousands of queries against a simulated 
cluster in a single thread, randomly turning nodes on and off, and checking 
that each node executed instances in the same order. It's pretty ugly, and 
needs to be expanded, but has been useful in uncovering small bugs. It's here: 
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/long/org/apache/cassandra/service/epaxos/EpaxosFuzzer.java#L48-48


> EPaxos
> ------
>
>                 Key: CASSANDRA-6246
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Blake Eggleston
>            Priority: Minor
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to