Liu Cao created CASSANALYTICS-79:
------------------------------------

             Summary: Make Token ranges calculation rack aware for 
spark.data.partitioner.CassandraRing
                 Key: CASSANALYTICS-79
                 URL: https://issues.apache.org/jira/browse/CASSANALYTICS-79
             Project: Apache Cassandra Analytics
          Issue Type: Improvement
            Reporter: Liu Cao


discussion context 
[https://github.com/apache/cassandra-analytics/pull/93#issuecomment-3099771451]

 

Per the [comment|#L61], the token range calculation used in Cassandra analytics 
bulk read data source is not rack aware. This leads to incorrect pick of 
replication target when reading SSTables.

 

Calculation logic is 
[here|https://github.com/apache/cassandra-analytics/blob/6e1d5257a8d6c910a42751475612145533ae3b1d/cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/RangeUtils.java#L158]

 

Concrete example of breakage:

We have a cluster with 6 nodes with vnode=16, thereby creating a list of 96 
instances.
These 6 nodes reside in 3 different Availability Zones of the same AWS region 
(2 nodes in each AZ). In the gist below, node UUIDs are replaced with 
human-readable identifier for clarity - us-west-2a-node1, us-west-2a-node2, etc.

The replication factor for the keyspace is 3 and we use NetworkTopologyStrategy.

[https://gist.github.com/liucao-dd/542eb0868d2e080733ca3fe127719114]

 

1-indexed entry - {{ \{"token"="-9067202222264017285", 
"node"="us-west-2b-node1", "dc"="us-west-2"},}}
(1 + 96 - 3) % 96 = 94, looking at the 94-indexed entry - {{{} 
\{"token"="8821921454609098249", "node"="us-west-2b-node2", 
"dc"="us-west-2"}{}}}, indicating this vnode on {{us-west-2b-node1}} holds 
replica / data for token range {{(8821921454609098249, MAX]}} and {{{}(MIN, 
-9067202222264017285){}}}.

2-indexed entry - {{{"token"="-8862464739686088316", "node"="us-west-2b-node2", 
"dc"="us-west-2"},}}
(2 + 96 - 3) % 96 = 95, looking at the 95-indexed entry - {{{} 
\{"token"="8957072497331100633", "node"="us-west-2c-node1", 
"dc"="us-west-2"}{}}}, indicating this vnode on {{us-west-2b-node2}} holds 
replica / data for the token range {{(8957072497331100633, MAX]}} and {{{}(MIN, 
-8862464739686088316){}}}.

Now, looking at the token range {{(8957072497331100633, MAX]}} specifically. We 
should have this token range replicates in 1 node in each of the AZ. However, 
both {{us-west-2b-node1}} and {{{}us-west-2b-node2{}}}, in the same AZ (rack) 
now both hold replica / data for it. This contradicts with the rack-aware token 
placement where us-west-2a and us-west-2c would each hold 1 replica for this 
token range (RF=3 with 3 AZ/racks). In reality, the us-west-2b-node2 does not 
hold replica data for it.

 

This issue is not limited to vnode, but a generic problem when we have rack 
aware placement of token ring.

 

One solution is to utilize the sidecar's {{/api/v1/token-range-replicas}} 
endpoint

 

Also, given we already calls the ring endpoint for keyspace, its response 
include the rack information for each node already. We can consider adding an 
optional new attribute to the class CassandraInstance and calculate the token 
range according to the replication strategy as well, following standard 
cassandra logic such as those specified in 
[https://github.com/datastax/python-driver/blob/0979b897549de4578eda31dfd9e1e1a2f080c926/cassandra/metadata.py#L581]
 or 

[https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/core/src/main/java/com/datastax/oss/driver/internal/core/metadata/token/NetworkTopologyReplicationStrategy.java#L59]

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to