[
https://issues.apache.org/jira/browse/CASSANDRA-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280680#comment-15280680
]
Alex Petrov edited comment on CASSANDRA-7592 at 5/12/16 7:33 PM:
-----------------------------------------------------------------
[EDITED]
My previous comment was off, as I ran the tests incorrectly.
I have to note that in 3 node cluster such situation is not possible, if I
understand everything correctly. In order to have the node bootstrapped, with
consistent range movement, all replica has to be alive in order for node to be
bootstrapped, otherwise we'd get {{A node required to move the data
consistently is down}} exception (so it's quite tricky to even reproduce). This
doesn't really contradict the issue description as coordinators are said to be
chosen randomly, though I wanted to point it out.
So for example:
{code}
for i in `seq 1 4`; do ccm node$i remove; done
ccm populate -n 3
ccm start
sleep 10
ccm node1 cqlsh < data.cql
# With node1 gossip disabled that won't work, as joning node4 won't be able to
bootrap
# ccm node1 nodetool disablegossip
# sleep 10
ccm add node4 -i 127.0.0.4 -b && ccm node4 start
{code}
Would produce something like that:
{code}
node1
INFO [SharedPool-Worker-1] 2016-05-12 18:28:05,915 Gossiper.java:1014 -
InetAddress /127.0.0.4 is now UP
INFO [PendingRangeCalculator:1] 2016-05-12 18:28:08,923 TokenMetadata.java:209
- Adding token to endpoint map `577167728929728286` `/127.0.0.4`
node4
INFO [main] 2016-05-12 18:28:38,448 StorageService.java:1460 - Starting
Bootstrap
INFO [main] 2016-05-12 18:28:38,453 TokenMetadata.java:209 - Adding token to
endpoint map `577167728929728286` `/127.0.0.4`
INFO [main] 2016-05-12 18:28:38,903 StorageService.java:2173 - Node /127.0.0.4
state jump to NORMAL
{code}
Note here that the token for the endpoint lands on the node1 as pending token
early enough, and {{StorageProxy::performWrite}} would be aware of the pending
tokens as well.
However, judging from the code this situation might actually occur in a larger
cluster, where coordinator is chosen from nodes that's not a part of replica
for streamed ranges. I'll try to take another try on it some time soon.
Although so far I can not see a good solution that would allow simultaneous
moves. Given simultaneous moves aren't allowed, the next move might have to be
postponed for (or disallowed until) {{ring_delay}}, then we can store the
{{tokenEndpointMap}} for the {{ring_delay}} on replicas.
Coordinator that hasn't learned about ring change will then contact the {{old}}
nodes noly. Coordinator that has learned about the ring change will send writes
to both {{old}} and {{new}} nodes. More On the replica side, {{old}} replica
has to gracefully serve reads and receive writes.
Might be a good idea to use {{TokenAwarePolicy}}, which might help in some
cases (when partition key is actually known).
was (Author: ifesdjeen):
[EDITED]
My previous comment was off, as I ran the tests incorrectly.
I have to note that in 3 node cluster such situation is not possible, if I
understand everything correctly. In order to have the node bootstrapped, with
consistent range movement, all replica has to be alive in order for node to be
bootstrapped, otherwise we'd get {{A node required to move the data
consistently is down}} exception (so it's quite tricky to even reproduce). This
doesn't really contradict the issue description as coordinators are said to be
chosen randomly, though I wanted to point it out.
So for example:
{code}
for i in `seq 1 4`; do ccm node$i remove; done
ccm populate -n 3
ccm start
sleep 10
ccm node1 cqlsh < data.cql
# With node1 gossip disabled that won't work, as joning node4 won't be able to
bootrap
# ccm node1 nodetool disablegossip
# sleep 10
ccm add node4 -i 127.0.0.4 -b && ccm node4 start
{code}
Would produce something like that:
{code}
node1
INFO [SharedPool-Worker-1] 2016-05-12 18:28:05,915 Gossiper.java:1014 -
InetAddress /127.0.0.4 is now UP
INFO [PendingRangeCalculator:1] 2016-05-12 18:28:08,923 TokenMetadata.java:209
- Adding token to endpoint map `577167728929728286` `/127.0.0.4`
node4
INFO [main] 2016-05-12 18:28:38,448 StorageService.java:1460 - Starting
Bootstrap
INFO [main] 2016-05-12 18:28:38,453 TokenMetadata.java:209 - Adding token to
endpoint map `577167728929728286` `/127.0.0.4`
INFO [main] 2016-05-12 18:28:38,903 StorageService.java:2173 - Node /127.0.0.4
state jump to NORMAL
{code}
Note here that the token for the endpoint lands on the node1 as pending token
early enough, and {{StorageProxy::performWrite}} would be aware of the pending
tokens as well.
However, judging from the code this situation might actually occur in a larger
cluster, where coordinator is chosen from nodes that's not a part of replica
for streamed ranges. I'll try to take another try on it some time soon.
Although so far I can not see a good solution that would allow simultaneous
moves. Given simultaneous moves aren't allowed, the next move might have to be
postponed for (or disallowed until) {{ring_delay}}, then we can store the
{{tokenEndpointMap}} for the {{ring_delay}} on replicas.
Coordinator that hasn't learned about ring change will then contact the {{old}}
nodes noly. Coordinator that has learned about the ring change will send writes
to both {{old}} and {{new }} nodes. More On the replica side, {{old}} replica
has to gracefully serve reads and receive writes.
Might be a good idea to use {{TokenAwarePolicy}}, which might help in some
cases (when partition key is actually known).
> Ownership changes can violate consistency
> -----------------------------------------
>
> Key: CASSANDRA-7592
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7592
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Richard Low
> Assignee: Alex Petrov
>
> CASSANDRA-2434 goes a long way to avoiding consistency violations when
> growing a cluster. However, there is still a window when consistency can be
> violated when switching ownership of a range.
> Suppose you have replication factor 3 and all reads and writes at quorum. The
> first part of the ring looks like this:
> Z: 0
> A: 100
> B: 200
> C: 300
> Choose two random coordinators, C1 and C2. Then you bootstrap node X at token
> 50.
> Consider the token range 0-50. Before bootstrap, this is stored on A, B, C.
> During bootstrap, writes go to X, A, B, C (and must succeed on 3) and reads
> choose two from A, B, C. After bootstrap, the range is on X, A, B.
> When the bootstrap completes, suppose C1 processes the ownership change at t1
> and C2 at t4. Then the following can give an inconsistency:
> t1: C1 switches ownership.
> t2: C1 performs write, so sends write to X, A, B. A is busy and drops the
> write, but it succeeds because X and B return.
> t3: C2 performs a read. It hasn’t done the switch and chooses A and C.
> Neither got the write at t2 so null is returned.
> t4: C2 switches ownership.
> This could be solved by continuing writes to the old replica for some time
> (maybe ring delay) after the ownership changes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)