Richard Low created CASSANDRA-10887:
---------------------------------------
Summary: Pending range calculator gives wrong pending ranges for
moves
Key: CASSANDRA-10887
URL: https://issues.apache.org/jira/browse/CASSANDRA-10887
Project: Cassandra
Issue Type: Bug
Components: Coordination
Reporter: Richard Low
Priority: Critical
My understanding is the PendingRangeCalculator is meant to calculate who should
receive extra writes during range movements. However, it adds the wrong ranges
for moves. An extreme example of this can be seen in the following
reproduction. Create a 5 node cluster (I did this on 2.0.16 and 2.2.4) and a
keyspace RF=3 and a simple table. Then start moving a node and immediately kill
-9 it. Now you see a node as down and moving in the ring. Try a quorum write
for a partition that is stored on that node - it will fail with a timeout.
Further, all CAS reads or writes fail immediately with unavailable exception
because they attempt to include the moving node twice. This is likely to be the
cause of CASSANDRA-10423.
In my example I had this ring:
127.0.0.1 rack1 Up Normal 170.97 KB 20.00%
-9223372036854775808
127.0.0.2 rack1 Up Normal 124.06 KB 20.00%
-5534023222112865485
127.0.0.3 rack1 Down Moving 108.7 KB 40.00%
1844674407370955160
127.0.0.4 rack1 Up Normal 142.58 KB 0.00%
1844674407370955161
127.0.0.5 rack1 Up Normal 118.64 KB 20.00%
5534023222112865484
Node 3 was moving to -1844674407370955160. I added logging to print the pending
and natural endpoints. For ranges owned by node 3, node 3 appeared in pending
and natural endpoints. The blockFor is increased to 3 so we’re effectively
doing CL.ALL operations. This manifests as write timeouts and CAS unavailables
when the node is down.
The correct pending range for this scenario is node 1 is gaining the range
(-1844674407370955160, 1844674407370955160). So node 1 should be added as a
destination for writes and CAS for this range, not node 3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)