[ 
https://issues.apache.org/jira/browse/CASSANDRA-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280680#comment-15280680
 ] 

Alex Petrov edited comment on CASSANDRA-7592 at 5/12/16 7:33 PM:
-----------------------------------------------------------------

[EDITED] 

My previous comment was off, as I ran the tests incorrectly. 
I have to note that in 3 node cluster such situation is not possible, if I 
understand everything correctly. In order to have the node bootstrapped, with 
consistent range movement, all replica has to be alive in order for node to be 
bootstrapped, otherwise we'd get {{A node required to move the data 
consistently is down}} exception (so it's quite tricky to even reproduce). This 
doesn't really contradict the issue description as coordinators are said to be 
chosen randomly, though I wanted to point it out.

So for example:
{code}
for i in `seq 1 4`; do ccm node$i remove; done
ccm populate -n 3
ccm start
sleep 10
ccm node1 cqlsh < data.cql
# With node1 gossip disabled that won't work, as joning node4 won't be able to 
bootrap
# ccm node1 nodetool disablegossip
# sleep 10
ccm add node4 -i 127.0.0.4 -b && ccm node4 start
{code}

Would produce something like that:
{code}
node1 
INFO  [SharedPool-Worker-1] 2016-05-12 18:28:05,915 Gossiper.java:1014 - 
InetAddress /127.0.0.4 is now UP
INFO  [PendingRangeCalculator:1] 2016-05-12 18:28:08,923 TokenMetadata.java:209 
- Adding token to endpoint map `577167728929728286` `/127.0.0.4`

node4
INFO  [main] 2016-05-12 18:28:38,448 StorageService.java:1460 - Starting 
Bootstrap
INFO  [main] 2016-05-12 18:28:38,453 TokenMetadata.java:209 - Adding token to 
endpoint map `577167728929728286` `/127.0.0.4`
INFO  [main] 2016-05-12 18:28:38,903 StorageService.java:2173 - Node /127.0.0.4 
state jump to NORMAL
{code}

Note here that the token for the endpoint lands on the node1 as pending token 
early enough, and {{StorageProxy::performWrite}} would be aware of the pending 
tokens as well.

However, judging from the code this situation might actually occur in a larger 
cluster, where coordinator is chosen from nodes that's not a part of replica 
for streamed ranges. I'll try to take another try on it some time soon. 
Although so far I can not see a good solution that would allow simultaneous 
moves. Given simultaneous moves aren't allowed, the next move might have to be 
postponed for (or disallowed until) {{ring_delay}}, then we can store the 
{{tokenEndpointMap}} for the {{ring_delay}} on replicas. 

Coordinator that hasn't learned about ring change will then contact the {{old}} 
nodes noly. Coordinator that has learned about the ring change will send writes 
to both {{old}} and {{new}} nodes. More On the replica side, {{old}} replica 
has to gracefully serve reads and receive writes.

Might be a good idea to use {{TokenAwarePolicy}}, which might help in some 
cases (when partition key is actually known). 


was (Author: ifesdjeen):
[EDITED] 

My previous comment was off, as I ran the tests incorrectly. 
I have to note that in 3 node cluster such situation is not possible, if I 
understand everything correctly. In order to have the node bootstrapped, with 
consistent range movement, all replica has to be alive in order for node to be 
bootstrapped, otherwise we'd get {{A node required to move the data 
consistently is down}} exception (so it's quite tricky to even reproduce). This 
doesn't really contradict the issue description as coordinators are said to be 
chosen randomly, though I wanted to point it out.

So for example:
{code}
for i in `seq 1 4`; do ccm node$i remove; done
ccm populate -n 3
ccm start
sleep 10
ccm node1 cqlsh < data.cql
# With node1 gossip disabled that won't work, as joning node4 won't be able to 
bootrap
# ccm node1 nodetool disablegossip
# sleep 10
ccm add node4 -i 127.0.0.4 -b && ccm node4 start
{code}

Would produce something like that:
{code}
node1 
INFO  [SharedPool-Worker-1] 2016-05-12 18:28:05,915 Gossiper.java:1014 - 
InetAddress /127.0.0.4 is now UP
INFO  [PendingRangeCalculator:1] 2016-05-12 18:28:08,923 TokenMetadata.java:209 
- Adding token to endpoint map `577167728929728286` `/127.0.0.4`

node4
INFO  [main] 2016-05-12 18:28:38,448 StorageService.java:1460 - Starting 
Bootstrap
INFO  [main] 2016-05-12 18:28:38,453 TokenMetadata.java:209 - Adding token to 
endpoint map `577167728929728286` `/127.0.0.4`
INFO  [main] 2016-05-12 18:28:38,903 StorageService.java:2173 - Node /127.0.0.4 
state jump to NORMAL
{code}

Note here that the token for the endpoint lands on the node1 as pending token 
early enough, and {{StorageProxy::performWrite}} would be aware of the pending 
tokens as well.

However, judging from the code this situation might actually occur in a larger 
cluster, where coordinator is chosen from nodes that's not a part of replica 
for streamed ranges. I'll try to take another try on it some time soon. 
Although so far I can not see a good solution that would allow simultaneous 
moves. Given simultaneous moves aren't allowed, the next move might have to be 
postponed for (or disallowed until) {{ring_delay}}, then we can store the 
{{tokenEndpointMap}} for the {{ring_delay}} on replicas. 

Coordinator that hasn't learned about ring change will then contact the {{old}} 
nodes noly. Coordinator that has learned about the ring change will send writes 
to both {{old}} and {{new }} nodes. More On the replica side, {{old}} replica 
has to gracefully serve reads and receive writes.

Might be a good idea to use {{TokenAwarePolicy}}, which might help in some 
cases (when partition key is actually known). 

> Ownership changes can violate consistency
> -----------------------------------------
>
>                 Key: CASSANDRA-7592
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7592
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Richard Low
>            Assignee: Alex Petrov
>
> CASSANDRA-2434 goes a long way to avoiding consistency violations when 
> growing a cluster. However, there is still a window when consistency can be 
> violated when switching ownership of a range.
> Suppose you have replication factor 3 and all reads and writes at quorum. The 
> first part of the ring looks like this:
> Z: 0
> A: 100
> B: 200
> C: 300
> Choose two random coordinators, C1 and C2. Then you bootstrap node X at token 
> 50.
> Consider the token range 0-50. Before bootstrap, this is stored on A, B, C. 
> During bootstrap, writes go to X, A, B, C (and must succeed on 3) and reads 
> choose two from A, B, C. After bootstrap, the range is on X, A, B.
> When the bootstrap completes, suppose C1 processes the ownership change at t1 
> and C2 at t4. Then the following can give an inconsistency:
> t1: C1 switches ownership.
> t2: C1 performs write, so sends write to X, A, B. A is busy and drops the 
> write, but it succeeds because X and B return.
> t3: C2 performs a read. It hasn’t done the switch and chooses A and C. 
> Neither got the write at t2 so null is returned.
> t4: C2 switches ownership.
> This could be solved by continuing writes to the old replica for some time 
> (maybe ring delay) after the ownership changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to