[
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574539#comment-14574539
]
Blake Eggleston edited comment on CASSANDRA-6246 at 6/5/15 3:56 PM:
--------------------------------------------------------------------
So I think this is at a point where it's ready for review. Epaxos rebased
against trunk is here:
https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The
supporting dtests are here:
https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246.
Most of the problems and solutions for supporting repair, read repair,
bootstrap and failure recovery are discussed above, so I won't go into it here.
The upgrade from current paxos to epaxos is opt in, and is done via nodetool
upgradepaxos. That needs to be run on each node after the cluster has been
upgraded, and will transition a cluster from paxos to epaxos. Serialized
queries are still processed during the upgrade. How that works is explained
here:
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/src/java/org/apache/cassandra/service/epaxos/UpgradeService.java#L83.
For new clusters, or nodes being added to upgrading clusters, there's a yaml
file parameter you can set which will make the node startup with epaxos as the
default.
Epaxos keeps the metadata for SERIAL and LOCAL_SERIAL queries completely
separate. Keeping track of the two with the same set of metadata would have
introduced a ton of complexity. Obviously mixing the two is a bad idea, and you
give up the serialization guarantees when you do it anyway, so epaxos doesn't
even bother trying.
There is some weird stuff going on with how some things are serialized. First,
for cas requests, the statement objects aren't serialized when attached to
instances... the query strings and their parameters are. This is because
serializing the important parts of a CQLCasRequest would have required
serializers for dozens of classes that I'm not familiar with, and that didn't
seem to be intended for serialization. Serializing the query string and
parameter bytes is a lot less elegant, but a lot more reliable. Hopefully 8099
will make it a bit easier to do this correctly. Second, most of the epaxos
metadata is persisted as blobs. This is mainly because the dependency graph
meta data is very queue like in it usage patterns, so dumping it as a big blob
when there are changes prevents a lot of headaches. For the instances, since
there are 3 types, each with different attributes, it seemed less risky to
maintain a single serializer vs a serializer and select/insert statements. At
this point, it would probably be ok to break it out into a proper table, but I
don't have strong feelings about it either way. The token meta data was saved
as blobs just because the other two were, iirc.
Regarding performance, I've run some tests today, this time with one and two
datacenters and with more concurrent clients. For a single datacenter, the
epaxos median response time is 40-50% faster than regular paxos. However, the
95th and 99th percentiles are actually worse. I'm not sure why that is, but
will be looking into that in the next week or so. In multiple datacenters,
epaxos is 70-75% faster than regular paxos, and the 95th & 99th percentiles are
50-70% faster as well. I haven't tested contended performance yet, and will do
those in the next week or so. I'd expect them to be similar to last time though.
The patch is about 50% tests. I've tried to be very thorough. The core epaxos
services wrap calls to singletons in protected methods to make testing the
interaction of multiple nodes in unit tests straightforward. In addition to
dtests, there are a bunch of junit tests that put a simulated cluster in
strange and probably rare failure conditions to test recovery, like this one:
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/unit/org/apache/cassandra/service/epaxos/integration/EpaxosIntegrationRF3Test.java#L144-144.
I've also written a test that executes thousands of queries against a simulated
cluster in a single thread, randomly turning nodes on and off, and checking
that each node executed instances in the same order. It's pretty ugly, and
needs to be expanded, but has been useful in uncovering small bugs. It's here:
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/long/org/apache/cassandra/service/epaxos/EpaxosFuzzer.java#L48-48
was (Author: bdeggleston):
So I think this is at a point where it's ready for review. Epaxos rebased
against trunk is here:
https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The
supporting dtests are here:
https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246.
Most of the problems and solutions for supporting repair, read repair,
bootstrap and failure recovery are discussed above, so I won't go into it here.
The upgrade from current paxos to epaxos is opt in, and is done via nodetool
upgradepaxos. That needs to be run on each node after the cluster has been
upgraded, and will transition a cluster from paxos to epaxos. Serialized
queries are still processed during the upgrade. How that works is explained
here:
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/src/java/org/apache/cassandra/service/epaxos/UpgradeService.java#L83.
For new clusters, or nodes being added to upgrading clusters, there's a yaml
file parameter you can set which will make the cluster.
Epaxos keeps the metadata for SERIAL and LOCAL_SERIAL queries completely
separate. Keeping track of the two with the same set of metadata would have
introduced a ton of complexity. Obviously mixing the two is a bad idea, and you
give up the serialization guarantees when you do it anyway, so epaxos doesn't
even bother trying.
There is some weird stuff going on with how some things are serialized. First,
for cas requests, the statement objects aren't serialized when attached to
instances... the query strings and their parameters are. This is because
serializing the important parts of a CQLCasRequest would have required
serializers for dozens of classes that I'm not familiar with, and that didn't
seem to be intended for serialization. Serializing the query string and
parameter bytes is a lot less elegant, but a lot more reliable. Hopefully 8099
will make it a bit easier to do this correctly. Second, most of the epaxos
metadata is persisted as blobs. This is mainly because the dependency graph
meta data is very queue like in it usage patterns, so dumping it as a big blob
when there are changes prevents a lot of headaches. For the instances, since
there are 3 types, each with different attributes, it seemed less risky to
maintain a single serializer vs a serializer and select/insert statements. At
this point, it would probably be ok to break it out into a proper table, but I
don't have strong feelings about it either way. The token meta data was saved
as blobs just because the other two were, iirc.
Regarding performance, I've run some tests today, this time with one and two
datacenters and with more concurrent clients. For a single datacenter, the
epaxos median response time is 40-50% faster than regular paxos. However, the
95th and 99th percentiles are actually worse. I'm not sure why that is, but
will be looking into that in the next week or so. In multiple datacenters,
epaxos is 70-75% faster than regular paxos, and the 95th & 99th percentiles are
50-70% faster as well. I haven't tested contended performance yet, and will do
those in the next week or so. I'd expect them to be similar to last time though.
The patch is about 50% tests. I've tried to be very thorough. The core epaxos
services wrap calls to singletons in protected methods to make testing the
interaction of multiple nodes in unit tests straightforward. In addition to
dtests, there are a bunch of junit tests that put a simulated cluster in
strange and probably rare failure conditions to test recovery, like this one:
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/unit/org/apache/cassandra/service/epaxos/integration/EpaxosIntegrationRF3Test.java#L144-144.
I've also written a test that executes thousands of queries against a simulated
cluster in a single thread, randomly turning nodes on and off, and checking
that each node executed instances in the same order. It's pretty ugly, and
needs to be expanded, but has been useful in uncovering small bugs. It's here:
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/long/org/apache/cassandra/service/epaxos/EpaxosFuzzer.java#L48-48
> EPaxos
> ------
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Blake Eggleston
> Fix For: 3.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is
> that Multi-paxos requires leader election and hence, a period of
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos,
> (2) is particularly useful across multiple datacenters, and (3) allows any
> node to act as coordinator:
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to
> implement it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)