[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574539#comment-14574539 ]
Blake Eggleston commented on CASSANDRA-6246: -------------------------------------------- So I think this is at a point where it's ready for review. Epaxos rebased against trunk is here: https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The supporting dtests are here: https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246. Most of the problems and solutions for supporting repair, read repair, bootstrap and failure recovery are discussed above, so I won't go into it here. The upgrade from current paxos to epaxos is opt in, and is done via nodetool upgradepaxos. That needs to be run on each node after the cluster has been upgraded, and will transition a cluster from paxos to epaxos. Serialized queries are still processed during the upgrade. How that works is explained here: https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/src/java/org/apache/cassandra/service/epaxos/UpgradeService.java#L83. For new clusters, or nodes being added to upgrading clusters, there's a yaml file parameter you can set which will make the cluster. Epaxos keeps the metadata for SERIAL and LOCAL_SERIAL queries completely separate. Keeping track of the two with the same set of metadata would have introduced a ton of complexity. Obviously mixing the two is a bad idea, and you give up the serialization guarantees when you do it anyway, so epaxos doesn't even bother trying. There is some weird stuff going on with how some things are serialized. First, for cas requests, the statement objects aren't serialized when attached to instances... the query strings and their parameters are. This is because serializing the important parts of a CQLCasRequest would have required serializers for dozens of classes that I'm not familiar with, and that didn't seem to be intended for serialization. Serializing the query string and parameter bytes is a lot less elegant, but a lot more reliable. Hopefully 8099 will make it a bit easier to do this correctly. Second, most of the epaxos metadata is persisted as blobs. This is mainly because the dependency graph meta data is very queue like in it usage patterns, so dumping it as a big blob when there are changes prevents a lot of headaches. For the instances, since there are 3 types, each with different attributes, it seemed less risky to maintain a single serializer vs a serializer and select/insert statements. At this point, it would probably be ok to break it out into a proper table, but I don't have strong feelings about it either way. The token meta data was saved as blobs just because the other two were, iirc. Regarding performance, I've run some tests today, this time with one and two datacenters and with more concurrent clients. For a single datacenter, the epaxos median response time is 40-50% faster than regular paxos. However, the 95th and 99th percentiles are actually worse. I'm not sure why that is, but will be looking into that in the next week or so. In multiple datacenters, epaxos is 70-75% faster than regular paxos, and the 95th & 99th percentiles are 50-70% faster as well. I haven't tested contended performance yet, and will do those in the next week or so. I'd expect them to be similar to last time though. The patch is about 50% tests. I've tried to be very thorough. The core epaxos services wrap calls to singletons in protected methods to make testing the interaction of multiple nodes in unit tests straightforward. In addition to dtests, there are a bunch of junit tests that put a simulated cluster in strange and probably rare failure conditions to test recovery, like this one: https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/unit/org/apache/cassandra/service/epaxos/integration/EpaxosIntegrationRF3Test.java#L144-144. I've also written a test that executes thousands of queries against a simulated cluster in a single thread, randomly turning nodes on and off, and checking that each node executed instances in the same order. It's pretty ugly, and needs to be expanded, but has been useful in uncovering small bugs. It's here: https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/long/org/apache/cassandra/service/epaxos/EpaxosFuzzer.java#L48-48 > EPaxos > ------ > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Jonathan Ellis > Assignee: Blake Eggleston > Priority: Minor > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)