[jira] [Comment Edited] (CASSANDRA-6246) EPaxos

2017-08-15 Thread Igor Zubchenok (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159
 ] 

Igor Zubchenok edited comment on CASSANDRA-6246 at 8/16/17 12:45 AM:
-

I would like to try, but I'm not familiar with Cassandra source code. :( Isn't 
it easier to implement the patch again, but without rebase from 4 year old code?

BTW, I'm looking for a solution to implement a *reference counter based on 
Cassandra*. 

My first reference counter implementation has been made on counter columns, but 
unfortunately it had been ruined with tombstones issue - when a counter get 
back to zero, I cannot delete nor compact it.

My guess was that the lightweight Cassandra transactions can do a very good job 
for my task. I was so naive and now I have an issue with WriteTimeoutException 
and inconsistent state. 

The only workaround I came up with today is to do an exclusive lock that can be 
easily made with LWT with TTL, and subsequent change of a value, but it will 
have much more greater performance hit. I'm still looking for a good solution 
on that with Cassandra.

Currently I'm naive again and expecting that EPaxos will help me, but seems it 
will never-never be merged and released.

Dear community, do you have any idea?

P.S. Huge thanks and warm hugs to everyone who answers me!


was (Author: geagle):
I would like to try, but I'm not familiar with Cassandra source code. :( Isn't 
it easier to implement the patch again, but without rebase from 4 year old code?

BTW, I'm looking for a solution to implement a *reference counter based on 
Cassandra*. 

My first reference counter implementation has been made on counter columns, but 
unfortunately it had been ruined with tombstones issue - when a counter get 
back to zero, I cannot delete nor compact it.

My guess was that the lightweight Cassandra transactions can do a very good job 
for my task. I was so naive and now I have an issue with WriteTimeoutException 
and inconsistent state. 

The only workaround I came up with today is to do an exclusive lock that can be 
easily made with LWT with TTL, and subsequent change of a value, but it will 
have much more greater performance hit. I'm still looking for a good solution 
on that with Cassandra.

Currently I'm naive again and expecting that EPaxos will help me, but seems it 
will never-never be merged and released.

Dear community, do you have any idea?

Huge thanks to everyone who answers me.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-6246) EPaxos

2017-08-15 Thread Igor Zubchenok (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159
 ] 

Igor Zubchenok edited comment on CASSANDRA-6246 at 8/16/17 12:45 AM:
-

I would like to try, but I'm not familiar with Cassandra source code. :( Isn't 
it easier to implement the patch again, but without rebase from 4 year old code?

BTW, I'm looking for a solution to implement a *reference counter based on 
Cassandra*. 

My first reference counter implementation has been made on counter columns, but 
unfortunately it had been ruined with tombstones issue - when a counter get 
back to zero, I cannot delete nor compact it.

My guess was that the lightweight Cassandra transactions can do a very good job 
for my task. I was so naive and now I have an issue with WriteTimeoutException 
and inconsistent state. 

The only workaround I came up with today is to do an exclusive lock that can be 
easily made with LWT with TTL, and subsequent change of a value, but it will 
have much more greater performance hit. I'm still looking for a good solution 
on that with Cassandra.

Currently I'm naive again and expecting that EPaxos will help me, but seems it 
will never-never be merged and released.

Dear community, do you have any idea?

P.S. Huge thanks and warm hugs to everyone who answers to me!


was (Author: geagle):
I would like to try, but I'm not familiar with Cassandra source code. :( Isn't 
it easier to implement the patch again, but without rebase from 4 year old code?

BTW, I'm looking for a solution to implement a *reference counter based on 
Cassandra*. 

My first reference counter implementation has been made on counter columns, but 
unfortunately it had been ruined with tombstones issue - when a counter get 
back to zero, I cannot delete nor compact it.

My guess was that the lightweight Cassandra transactions can do a very good job 
for my task. I was so naive and now I have an issue with WriteTimeoutException 
and inconsistent state. 

The only workaround I came up with today is to do an exclusive lock that can be 
easily made with LWT with TTL, and subsequent change of a value, but it will 
have much more greater performance hit. I'm still looking for a good solution 
on that with Cassandra.

Currently I'm naive again and expecting that EPaxos will help me, but seems it 
will never-never be merged and released.

Dear community, do you have any idea?

P.S. Huge thanks and warm hugs to everyone who answers me!

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-6246) EPaxos

2017-08-15 Thread Igor Zubchenok (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159
 ] 

Igor Zubchenok edited comment on CASSANDRA-6246 at 8/16/17 12:44 AM:
-

I would like to try, but I'm not familiar with Cassandra source code. :( Isn't 
it easier to implement the patch again, but without rebase from 4 year old code?

BTW, I'm looking for a solution to implement a *reference counter based on 
Cassandra*. 

My first reference counter implementation has been made on counter columns, but 
unfortunately it had been ruined with tombstones issue - when a counter get 
back to zero, I cannot delete nor compact it.

My guess was that the lightweight Cassandra transactions can do a very good job 
for my task. I was so naive and now I have an issue with WriteTimeoutException 
and inconsistent state. 

The only workaround I came up with today is to do an exclusive lock that can be 
easily made with LWT with TTL, and subsequent change of a value, but it will 
have much more greater performance hit. I'm still looking for a good solution 
on that with Cassandra.

Currently I'm naive again and expecting that EPaxos will help me, but seems it 
will never-never be merged and released.

Dear community, do you have any idea?

Huge thanks to everyone who answers me.


was (Author: geagle):
I would like to try, but I'm not familiar with Cassandra source code. :( Isn't 
it easier to implement the patch again, but without rebase from 4 year old code?

BTW, I'm looking for a solution to implement a *reference counter based on 
Cassandra*. 

My first reference counter implementation has been made on counter columns, but 
unfortunately it had been ruined with tombstones issue - when a counter get 
back to zero, I cannot delete nor compact it.

My guess was that the lightweight Cassandra transactions can do a very good job 
for my task. I was so naive and now I have an issue with WriteTimeoutException 
and inconsistent state. 

The only workaround I came up with today is to do an exclusive lock that can be 
easily made with LWT with TTL, and subsequent change of a value, but it will 
have much more greater performance hit. I'm still looking for a good solution 
on that with Cassandra.

Currently I'm naive again and expecting that EPaxos will help me, but seems it 
will never-never be merged and released.

Dear community, do you have any idea?

Huge thanks to everyone who answer me.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-6246) EPaxos

2017-08-15 Thread Igor Zubchenok (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128159#comment-16128159
 ] 

Igor Zubchenok edited comment on CASSANDRA-6246 at 8/16/17 12:44 AM:
-

I would like to try, but I'm not familiar with Cassandra source code. :( Isn't 
it easier to implement the patch again, but without rebase from 4 year old code?

BTW, I'm looking for a solution to implement a *reference counter based on 
Cassandra*. 

My first reference counter implementation has been made on counter columns, but 
unfortunately it had been ruined with tombstones issue - when a counter get 
back to zero, I cannot delete nor compact it.

My guess was that the lightweight Cassandra transactions can do a very good job 
for my task. I was so naive and now I have an issue with WriteTimeoutException 
and inconsistent state. 

The only workaround I came up with today is to do an exclusive lock that can be 
easily made with LWT with TTL, and subsequent change of a value, but it will 
have much more greater performance hit. I'm still looking for a good solution 
on that with Cassandra.

Currently I'm naive again and expecting that EPaxos will help me, but seems it 
will never-never be merged and released.

Dear community, do you have any idea?

Huge thanks to everyone who answer me.


was (Author: geagle):
I would like to try, but I'm not familiar with Cassandra source code. :( Isn't 
it easier to implement the patch again, but without rebase from 4 year old code?

BTW, I'm looking for a solution to implement a *reference counter based on 
Cassandra*. 

My first reference counter implementation has been made on counter columns, but 
unfortunately it had been ruined with tombstones issue - when a counter get 
back to zero, I cannot delete nor compact it.

My guess was that the lightweight Cassandra transactions can do a very good job 
for my task. I was so naive and now I have an issue with WriteTimeoutException 
and inconsistent state. 

The only workaround I came up with today is to do an exclusive lock that can be 
easily made with LWT with TLL, and subsequent change of a value, but it will 
have much more greater performance hit. I'm still looking for a good solution 
on that with Cassandra.

Currently I'm naive again and expecting that EPaxos will help me, but seems it 
will never-never be merged and released.

Dear community, do you have any idea?

Huge thanks to everyone who answer me.

> EPaxos
> --
>
> Key: CASSANDRA-6246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Blake Eggleston
>  Labels: messaging-service-bump-required
> Fix For: 4.x
>
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-6246) EPaxos

2015-06-05 Thread Blake Eggleston (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574539#comment-14574539
 ] 

Blake Eggleston edited comment on CASSANDRA-6246 at 6/5/15 3:56 PM:


So I think this is at a point where it's ready for review.  Epaxos rebased 
against trunk is here: 
https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The 
supporting dtests are here: 
https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246.

Most of the problems and solutions for supporting repair, read repair, 
bootstrap and failure recovery are discussed above, so I won't go into it here.

The upgrade from current paxos to epaxos is opt in, and is done via nodetool 
upgradepaxos. That needs to be run on each node after the cluster has been 
upgraded, and will transition a cluster from paxos to epaxos. Serialized 
queries are still processed during the upgrade. How that works is explained 
here: 
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/src/java/org/apache/cassandra/service/epaxos/UpgradeService.java#L83.
 For new clusters, or nodes being added to upgrading clusters, there's a yaml 
file parameter you can set which will make the node startup with epaxos as the 
default.

Epaxos keeps the metadata for SERIAL and LOCAL_SERIAL queries completely 
separate. Keeping track of the two with the same set of metadata would have 
introduced a ton of complexity. Obviously mixing the two is a bad idea, and you 
give up the serialization guarantees when you do it anyway, so epaxos doesn't 
even bother trying.

There is some weird stuff going on with how some things are serialized. First, 
for cas requests, the statement objects aren't serialized when attached to 
instances... the query strings and their parameters are. This is because 
serializing the important parts of a CQLCasRequest would have required 
serializers for dozens of classes that I'm not familiar with, and that didn't 
seem to be intended for serialization. Serializing the query string and 
parameter bytes is a lot less elegant, but a lot more reliable. Hopefully 8099 
will make it a bit easier to do this correctly. Second, most of the epaxos 
metadata is persisted as blobs. This is mainly because the dependency graph 
meta data is very queue like in it usage patterns, so dumping it as a big blob 
when there are changes prevents a lot of headaches. For the instances, since 
there are 3 types, each with different attributes, it seemed less risky to 
maintain a single serializer vs a serializer and select/insert statements. At 
this point, it would probably be ok to break it out into a proper table, but I 
don't have strong feelings about it either way. The token meta data was saved 
as blobs just because the other two were, iirc.

Regarding performance, I've run some tests today, this time with one and two 
datacenters and with more concurrent clients. For a single datacenter, the 
epaxos median response time is 40-50% faster than regular paxos. However, the 
95th and 99th percentiles are actually worse. I'm not sure why that is, but 
will be looking into that in the next week or so. In multiple datacenters, 
epaxos is 70-75% faster than regular paxos, and the 95th  99th percentiles are 
50-70% faster as well. I haven't tested contended performance yet, and will do 
those in the next week or so. I'd expect them to be similar to last time though.

The patch is about 50% tests. I've tried to be very thorough. The core epaxos 
services wrap calls to singletons in protected methods to make testing the 
interaction of multiple nodes in unit tests straightforward. In addition to 
dtests, there are a bunch of junit tests that put a simulated cluster in 
strange and probably rare failure conditions to test recovery, like this one: 
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/unit/org/apache/cassandra/service/epaxos/integration/EpaxosIntegrationRF3Test.java#L144-144.
I've also written a test that executes thousands of queries against a simulated 
cluster in a single thread, randomly turning nodes on and off, and checking 
that each node executed instances in the same order. It's pretty ugly, and 
needs to be expanded, but has been useful in uncovering small bugs. It's here: 
https://github.com/bdeggleston/cassandra/blob/CASSANDRA-6246-trunk/test/long/org/apache/cassandra/service/epaxos/EpaxosFuzzer.java#L48-48



was (Author: bdeggleston):
So I think this is at a point where it's ready for review.  Epaxos rebased 
against trunk is here: 
https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk. The 
supporting dtests are here: 
https://github.com/bdeggleston/cassandra-dtest/tree/CASSANDRA-6246.

Most of the problems and solutions for supporting repair, read repair, 
bootstrap and failure recovery are discussed above, so I won't go into it here.

The upgrade from 

[jira] [Comment Edited] (CASSANDRA-6246) EPaxos

2015-02-24 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335767#comment-14335767
 ] 

sankalp kohli edited comment on CASSANDRA-6246 at 2/25/15 1:24 AM:
---

When do you plan to merge the new branches back into CASSANDRA-6246? 

I had some comments based on CASSANDRA-6246 which could be negated by your 
other branches. 


was (Author: kohlisankalp):
When do you plan to merge the new branches back into CASSANDRA-6246? 

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-6246) EPaxos

2015-01-06 Thread Blake Eggleston (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266223#comment-14266223
 ] 

Blake Eggleston edited comment on CASSANDRA-6246 at 1/6/15 3:11 PM:


By using the existing epaxos ordering constraints. Incrementing the epoch is 
done by an instance which takes all unacknowledged instances as dependencies 
for the token range it's incrementing the epoch for. The epoch can only be 
incremented if all previous instances have also been executed. 

I pushed up some commits that add the epoch functionality yesterday if you'd 
like to take a look: 
https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246


was (Author: bdeggleston):
By using the existing epaxos ordering constraints. Incrementing the epoch is 
done by an instance which takes all unacknowledged instances as dependencies 
for the token range it's incrementing the epoch for. The epoch can only be 
incremented if all previous instances have also been executed.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-6246) EPaxos

2014-09-28 Thread Blake Eggleston (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151256#comment-14151256
 ] 

Blake Eggleston edited comment on CASSANDRA-6246 at 9/28/14 11:15 PM:
--

bq. In the current implementation, we only keep the last commit per CQL 
partition. We can do the same for this as well.

Yeah I've been thinking about that some more. Just because we could keep a 
bunch of historical data doesn't mean we should. There may be situations where 
we need to keep more than one instance around though, specifically when the 
instance is part of a strongly connected component. Keeping some historical 
data would be useful for helping nodes recover from short failures where they 
miss several instances, but after a point, transmitting all the activity for 
the last hour or two would just be nuts. The other issue with relying on 
historical data for failure recovery is that you can't keep all of it, so you'd 
have dangling pointers on the older instances. 

For longer partitions, and nodes joining the ring, if we transmitted our 
current dependency bookkeeping for the token ranges they're replicating, the 
corresponding instances, and the current values for those instances, that 
should be enough to get going.

bq. I am also reading about epaxos recently and want to know when do you do the 
condition check in your implementation?

It would have to be when the instance is executed.


was (Author: bdeggleston):
bq. In the current implementation, we only keep the last commit per CQL 
partition. We can do the same for this as well.

Yeah I've been thinking about that some more. Just because we could keep a 
bunch of historical data doesn't mean we should. There may be situations where 
we need to keep more than one instance around though, specifically when the 
instance is part of a strongly connected component. Keeping some historical 
data would be useful for helping instances recover from short failures where 
they miss several instances, but after a point, transmitting all the activity 
for the last hour or two would just be nuts. The other issue with relying on 
historical data for failure recovery is that you can't keep all of it, so you'd 
have dangling pointers on the older instances. 

For longer partitions, and nodes joining the ring, if we transmitted our 
current dependency bookkeeping for the token ranges they're replicating, the 
corresponding instances, and the current values for those instances, that 
should be enough to get going.

bq. I am also reading about epaxos recently and want to know when do you do the 
condition check in your implementation?

It would have to be when the instance is executed.

 EPaxos
 --

 Key: CASSANDRA-6246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Assignee: Blake Eggleston
Priority: Minor

 One reason we haven't optimized our Paxos implementation with Multi-paxos is 
 that Multi-paxos requires leader election and hence, a period of 
 unavailability when the leader dies.
 EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
 (2) is particularly useful across multiple datacenters, and (3) allows any 
 node to act as coordinator: 
 http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
 However, there is substantial additional complexity involved if we choose to 
 implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)