[jira] [Comment Edited] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159 ] Igor Zubchenok edited comment on CASSANDRA-6246 at 8/16/17 12:45 AM: - I would like to try, but I'm not familiar with Cassandra source code. :( Isn't it easier to implement the patch again, but without rebase from 4 year old code? BTW, I'm looking for a solution to implement a *reference counter based on Cassandra*. My first reference counter implementation has been made on counter columns, but unfortunately it had been ruined with tombstones issue - when a counter get back to zero, I cannot delete nor compact it. My guess was that the lightweight Cassandra transactions can do a very good job for my task. I was so naive and now I have an issue with WriteTimeoutException and inconsistent state. The only workaround I came up with today is to do an exclusive lock that can be easily made with LWT with TTL, and subsequent change of a value, but it will have much more greater performance hit. I'm still looking for a good solution on that with Cassandra. Currently I'm naive again and expecting that EPaxos will help me, but seems it will never-never be merged and released. Dear community, do you have any idea? P.S. Huge thanks and warm hugs to everyone who answers me! was (Author: geagle): I would like to try, but I'm not familiar with Cassandra source code. :( Isn't it easier to implement the patch again, but without rebase from 4 year old code? BTW, I'm looking for a solution to implement a *reference counter based on Cassandra*. My first reference counter implementation has been made on counter columns, but unfortunately it had been ruined with tombstones issue - when a counter get back to zero, I cannot delete nor compact it. My guess was that the lightweight Cassandra transactions can do a very good job for my task. I was so naive and now I have an issue with WriteTimeoutException and inconsistent state. The only workaround I came up with today is to do an exclusive lock that can be easily made with LWT with TTL, and subsequent change of a value, but it will have much more greater performance hit. I'm still looking for a good solution on that with Cassandra. Currently I'm naive again and expecting that EPaxos will help me, but seems it will never-never be merged and released. Dear community, do you have any idea? Huge thanks to everyone who answers me. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159 ] Igor Zubchenok edited comment on CASSANDRA-6246 at 8/16/17 12:45 AM: - I would like to try, but I'm not familiar with Cassandra source code. :( Isn't it easier to implement the patch again, but without rebase from 4 year old code? BTW, I'm looking for a solution to implement a *reference counter based on Cassandra*. My first reference counter implementation has been made on counter columns, but unfortunately it had been ruined with tombstones issue - when a counter get back to zero, I cannot delete nor compact it. My guess was that the lightweight Cassandra transactions can do a very good job for my task. I was so naive and now I have an issue with WriteTimeoutException and inconsistent state. The only workaround I came up with today is to do an exclusive lock that can be easily made with LWT with TTL, and subsequent change of a value, but it will have much more greater performance hit. I'm still looking for a good solution on that with Cassandra. Currently I'm naive again and expecting that EPaxos will help me, but seems it will never-never be merged and released. Dear community, do you have any idea? P.S. Huge thanks and warm hugs to everyone who answers to me! was (Author: geagle): I would like to try, but I'm not familiar with Cassandra source code. :( Isn't it easier to implement the patch again, but without rebase from 4 year old code? BTW, I'm looking for a solution to implement a *reference counter based on Cassandra*. My first reference counter implementation has been made on counter columns, but unfortunately it had been ruined with tombstones issue - when a counter get back to zero, I cannot delete nor compact it. My guess was that the lightweight Cassandra transactions can do a very good job for my task. I was so naive and now I have an issue with WriteTimeoutException and inconsistent state. The only workaround I came up with today is to do an exclusive lock that can be easily made with LWT with TTL, and subsequent change of a value, but it will have much more greater performance hit. I'm still looking for a good solution on that with Cassandra. Currently I'm naive again and expecting that EPaxos will help me, but seems it will never-never be merged and released. Dear community, do you have any idea? P.S. Huge thanks and warm hugs to everyone who answers me! > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159 ] Igor Zubchenok edited comment on CASSANDRA-6246 at 8/16/17 12:44 AM: - I would like to try, but I'm not familiar with Cassandra source code. :( Isn't it easier to implement the patch again, but without rebase from 4 year old code? BTW, I'm looking for a solution to implement a *reference counter based on Cassandra*. My first reference counter implementation has been made on counter columns, but unfortunately it had been ruined with tombstones issue - when a counter get back to zero, I cannot delete nor compact it. My guess was that the lightweight Cassandra transactions can do a very good job for my task. I was so naive and now I have an issue with WriteTimeoutException and inconsistent state. The only workaround I came up with today is to do an exclusive lock that can be easily made with LWT with TTL, and subsequent change of a value, but it will have much more greater performance hit. I'm still looking for a good solution on that with Cassandra. Currently I'm naive again and expecting that EPaxos will help me, but seems it will never-never be merged and released. Dear community, do you have any idea? Huge thanks to everyone who answers me. was (Author: geagle): I would like to try, but I'm not familiar with Cassandra source code. :( Isn't it easier to implement the patch again, but without rebase from 4 year old code? BTW, I'm looking for a solution to implement a *reference counter based on Cassandra*. My first reference counter implementation has been made on counter columns, but unfortunately it had been ruined with tombstones issue - when a counter get back to zero, I cannot delete nor compact it. My guess was that the lightweight Cassandra transactions can do a very good job for my task. I was so naive and now I have an issue with WriteTimeoutException and inconsistent state. The only workaround I came up with today is to do an exclusive lock that can be easily made with LWT with TTL, and subsequent change of a value, but it will have much more greater performance hit. I'm still looking for a good solution on that with Cassandra. Currently I'm naive again and expecting that EPaxos will help me, but seems it will never-never be merged and released. Dear community, do you have any idea? Huge thanks to everyone who answer me. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159 ] Igor Zubchenok edited comment on CASSANDRA-6246 at 8/16/17 12:44 AM: - I would like to try, but I'm not familiar with Cassandra source code. :( Isn't it easier to implement the patch again, but without rebase from 4 year old code? BTW, I'm looking for a solution to implement a *reference counter based on Cassandra*. My first reference counter implementation has been made on counter columns, but unfortunately it had been ruined with tombstones issue - when a counter get back to zero, I cannot delete nor compact it. My guess was that the lightweight Cassandra transactions can do a very good job for my task. I was so naive and now I have an issue with WriteTimeoutException and inconsistent state. The only workaround I came up with today is to do an exclusive lock that can be easily made with LWT with TTL, and subsequent change of a value, but it will have much more greater performance hit. I'm still looking for a good solution on that with Cassandra. Currently I'm naive again and expecting that EPaxos will help me, but seems it will never-never be merged and released. Dear community, do you have any idea? Huge thanks to everyone who answer me. was (Author: geagle): I would like to try, but I'm not familiar with Cassandra source code. :( Isn't it easier to implement the patch again, but without rebase from 4 year old code? BTW, I'm looking for a solution to implement a *reference counter based on Cassandra*. My first reference counter implementation has been made on counter columns, but unfortunately it had been ruined with tombstones issue - when a counter get back to zero, I cannot delete nor compact it. My guess was that the lightweight Cassandra transactions can do a very good job for my task. I was so naive and now I have an issue with WriteTimeoutException and inconsistent state. The only workaround I came up with today is to do an exclusive lock that can be easily made with LWT with TLL, and subsequent change of a value, but it will have much more greater performance hit. I'm still looking for a good solution on that with Cassandra. Currently I'm naive again and expecting that EPaxos will help me, but seems it will never-never be merged and released. Dear community, do you have any idea? Huge thanks to everyone who answer me. > EPaxos > -- > > Key: CASSANDRA-6246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Blake Eggleston > Labels: messaging-service-bump-required > Fix For: 4.x > > > One reason we haven't optimized our Paxos implementation with Multi-paxos is > that Multi-paxos requires leader election and hence, a period of > unavailability when the leader dies. > EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, > (2) is particularly useful across multiple datacenters, and (3) allows any > node to act as coordinator: > http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf > However, there is substantial additional complexity involved if we choose to > implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574539#comment-14574539 ] Blake Eggleston edited comment on CASSANDRA-6246 at 6/5/15 3:56 PM: So I think this is at a point where it's ready for review. Epaxos rebased against trunk is here: https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The supporting dtests are here: https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246. Most of the problems and solutions for supporting repair, read repair, bootstrap and failure recovery are discussed above, so I won't go into it here. The upgrade from current paxos to epaxos is opt in, and is done via nodetool upgradepaxos. That needs to be run on each node after the cluster has been upgraded, and will transition a cluster from paxos to epaxos. Serialized queries are still processed during the upgrade. How that works is explained here: https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/src/java/org/apache/cassandra/service/epaxos/UpgradeService.java#L83. For new clusters, or nodes being added to upgrading clusters, there's a yaml file parameter you can set which will make the node startup with epaxos as the default. Epaxos keeps the metadata for SERIAL and LOCAL_SERIAL queries completely separate. Keeping track of the two with the same set of metadata would have introduced a ton of complexity. Obviously mixing the two is a bad idea, and you give up the serialization guarantees when you do it anyway, so epaxos doesn't even bother trying. There is some weird stuff going on with how some things are serialized. First, for cas requests, the statement objects aren't serialized when attached to instances... the query strings and their parameters are. This is because serializing the important parts of a CQLCasRequest would have required serializers for dozens of classes that I'm not familiar with, and that didn't seem to be intended for serialization. Serializing the query string and parameter bytes is a lot less elegant, but a lot more reliable. Hopefully 8099 will make it a bit easier to do this correctly. Second, most of the epaxos metadata is persisted as blobs. This is mainly because the dependency graph meta data is very queue like in it usage patterns, so dumping it as a big blob when there are changes prevents a lot of headaches. For the instances, since there are 3 types, each with different attributes, it seemed less risky to maintain a single serializer vs a serializer and select/insert statements. At this point, it would probably be ok to break it out into a proper table, but I don't have strong feelings about it either way. The token meta data was saved as blobs just because the other two were, iirc. Regarding performance, I've run some tests today, this time with one and two datacenters and with more concurrent clients. For a single datacenter, the epaxos median response time is 40-50% faster than regular paxos. However, the 95th and 99th percentiles are actually worse. I'm not sure why that is, but will be looking into that in the next week or so. In multiple datacenters, epaxos is 70-75% faster than regular paxos, and the 95th 99th percentiles are 50-70% faster as well. I haven't tested contended performance yet, and will do those in the next week or so. I'd expect them to be similar to last time though. The patch is about 50% tests. I've tried to be very thorough. The core epaxos services wrap calls to singletons in protected methods to make testing the interaction of multiple nodes in unit tests straightforward. In addition to dtests, there are a bunch of junit tests that put a simulated cluster in strange and probably rare failure conditions to test recovery, like this one: https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/unit/org/apache/cassandra/service/epaxos/integration/EpaxosIntegrationRF3Test.java#L144-144. I've also written a test that executes thousands of queries against a simulated cluster in a single thread, randomly turning nodes on and off, and checking that each node executed instances in the same order. It's pretty ugly, and needs to be expanded, but has been useful in uncovering small bugs. It's here: https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/long/org/apache/cassandra/service/epaxos/EpaxosFuzzer.java#L48-48 was (Author: bdeggleston): So I think this is at a point where it's ready for review. Epaxos rebased against trunk is here: https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The supporting dtests are here: https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246. Most of the problems and solutions for supporting repair, read repair, bootstrap and failure recovery are discussed above, so I won't go into it here. The upgrade from
[jira] [Comment Edited] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335767#comment-14335767 ] sankalp kohli edited comment on CASSANDRA-6246 at 2/25/15 1:24 AM: --- When do you plan to merge the new branches back into CASSANDRA-6246? I had some comments based on CASSANDRA-6246 which could be negated by your other branches. was (Author: kohlisankalp): When do you plan to merge the new branches back into CASSANDRA-6246? EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266223#comment-14266223 ] Blake Eggleston edited comment on CASSANDRA-6246 at 1/6/15 3:11 PM: By using the existing epaxos ordering constraints. Incrementing the epoch is done by an instance which takes all unacknowledged instances as dependencies for the token range it's incrementing the epoch for. The epoch can only be incremented if all previous instances have also been executed. I pushed up some commits that add the epoch functionality yesterday if you'd like to take a look: https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246 was (Author: bdeggleston): By using the existing epaxos ordering constraints. Incrementing the epoch is done by an instance which takes all unacknowledged instances as dependencies for the token range it's incrementing the epoch for. The epoch can only be incremented if all previous instances have also been executed. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-6246) EPaxos
[ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151256#comment-14151256 ] Blake Eggleston edited comment on CASSANDRA-6246 at 9/28/14 11:15 PM: -- bq. In the current implementation, we only keep the last commit per CQL partition. We can do the same for this as well. Yeah I've been thinking about that some more. Just because we could keep a bunch of historical data doesn't mean we should. There may be situations where we need to keep more than one instance around though, specifically when the instance is part of a strongly connected component. Keeping some historical data would be useful for helping nodes recover from short failures where they miss several instances, but after a point, transmitting all the activity for the last hour or two would just be nuts. The other issue with relying on historical data for failure recovery is that you can't keep all of it, so you'd have dangling pointers on the older instances. For longer partitions, and nodes joining the ring, if we transmitted our current dependency bookkeeping for the token ranges they're replicating, the corresponding instances, and the current values for those instances, that should be enough to get going. bq. I am also reading about epaxos recently and want to know when do you do the condition check in your implementation? It would have to be when the instance is executed. was (Author: bdeggleston): bq. In the current implementation, we only keep the last commit per CQL partition. We can do the same for this as well. Yeah I've been thinking about that some more. Just because we could keep a bunch of historical data doesn't mean we should. There may be situations where we need to keep more than one instance around though, specifically when the instance is part of a strongly connected component. Keeping some historical data would be useful for helping instances recover from short failures where they miss several instances, but after a point, transmitting all the activity for the last hour or two would just be nuts. The other issue with relying on historical data for failure recovery is that you can't keep all of it, so you'd have dangling pointers on the older instances. For longer partitions, and nodes joining the ring, if we transmitted our current dependency bookkeeping for the token ranges they're replicating, the corresponding instances, and the current values for those instances, that should be enough to get going. bq. I am also reading about epaxos recently and want to know when do you do the condition check in your implementation? It would have to be when the instance is executed. EPaxos -- Key: CASSANDRA-6246 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Blake Eggleston Priority: Minor One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies. EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf However, there is substantial additional complexity involved if we choose to implement it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)