[ https://issues.apache.org/jira/browse/CASSANALYTICS-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Liu Cao updated CASSANALYTICS-79: --------------------------------- Description: discussion context [https://github.com/apache/cassandra-analytics/pull/93#issuecomment-3099771451] Per the [comment|#L61], the token range calculation used in Cassandra analytics bulk read data source is not rack aware. This leads to incorrect pick of replication target when reading SSTables. Calculation logic is [here|https://github.com/apache/cassandra-analytics/blob/6e1d5257a8d6c910a42751475612145533ae3b1d/cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/RangeUtils.java#L158] Concrete example of breakage: We have a cluster with 6 nodes with vnode=16, thereby creating a list of 96 instances. These 6 nodes reside in 3 different Availability Zones of the same AWS region (2 nodes in each AZ). In the gist below, node UUIDs are replaced with human-readable identifier for clarity - us-west-2a-node1, us-west-2a-node2, etc. The replication factor for the keyspace is 3 and we use NetworkTopologyStrategy. [https://gist.github.com/liucao-dd/542eb0868d2e080733ca3fe127719114] 1-indexed entry - {code:java} {"token"="-9067202222264017285", "node"="us-west-2b-node1", "dc"="us-west-2"}{code} (1 + 96 - 3) % 96 = 94, looking at the 94-indexed entry - {code:java} {"token"="8821921454609098249", "node"="us-west-2b-node2", "dc"="us-west-2"}{code} , indicating this vnode on {{us-west-2b-node1}} holds replica / data for token range {{(8821921454609098249, MAX]}} and {{{}(MIN, -9067202222264017285){}}}. 2-indexed entry {code:java} {"token"="-8862464739686088316", "node"="us-west-2b-node2", "dc"="us-west-2"}{code} (2 + 96 - 3) % 96 = 95, looking at the 95-indexed entry - {code:java} {"token"="8957072497331100633", "node"="us-west-2c-node1", "dc"="us-west-2"}{code} indicating this vnode on {{us-west-2b-node2}} holds replica / data for the token range {{(8957072497331100633, MAX]}} and {{{}(MIN, -8862464739686088316){}}}. Now, looking at the token range {{(8957072497331100633, MAX]}} specifically. We should have this token range replicates in 1 node in each of the AZ. However, both {{us-west-2b-node1}} and {{{}us-west-2b-node2{}}}, in the same AZ (rack) now both hold replica / data for it. This contradicts with the rack-aware token placement where us-west-2a and us-west-2c would each hold 1 replica for this token range (RF=3 with 3 AZ/racks). In reality, the us-west-2b-node2 does not hold replica data for it. This issue is not limited to vnode, but a generic problem when we have rack aware placement of token ring. One solution is to utilize the sidecar's {{/api/v1/token-range-replicas}} endpoint Also, given we already calls the ring endpoint `api/v1/cassandra/ring`, its response includes the rack information for each node already. We can consider adding an optional new attribute to the class CassandraInstance and calculate the token range according to the replication strategy as well, following standard cassandra logic such as those specified in [https://github.com/datastax/python-driver/blob/0979b897549de4578eda31dfd9e1e1a2f080c926/cassandra/metadata.py#L581] or [https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/core/src/main/java/com/datastax/oss/driver/internal/core/metadata/token/NetworkTopologyReplicationStrategy.java#L59] was: discussion context [https://github.com/apache/cassandra-analytics/pull/93#issuecomment-3099771451] Per the [comment|#L61], the token range calculation used in Cassandra analytics bulk read data source is not rack aware. This leads to incorrect pick of replication target when reading SSTables. Calculation logic is [here|https://github.com/apache/cassandra-analytics/blob/6e1d5257a8d6c910a42751475612145533ae3b1d/cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/RangeUtils.java#L158] Concrete example of breakage: We have a cluster with 6 nodes with vnode=16, thereby creating a list of 96 instances. These 6 nodes reside in 3 different Availability Zones of the same AWS region (2 nodes in each AZ). In the gist below, node UUIDs are replaced with human-readable identifier for clarity - us-west-2a-node1, us-west-2a-node2, etc. The replication factor for the keyspace is 3 and we use NetworkTopologyStrategy. [https://gist.github.com/liucao-dd/542eb0868d2e080733ca3fe127719114] 1-indexed entry - `\{"token"="-9067202222264017285", "node"="us-west-2b-node1", "dc"="us-west-2"}`, (1 + 96 - 3) % 96 = 94, looking at the 94-indexed entry - {{{}{"token"="8821921454609098249", "node"="us-west-2b-node2", "dc"="us-west-2"}{}}}, indicating this vnode on {{us-west-2b-node1}} holds replica / data for token range {{(8821921454609098249, MAX]}} and {{{}(MIN, -9067202222264017285){}}}. 2-indexed entry - \{"token"="-8862464739686088316", "node"="us-west-2b-node2", "dc"="us-west-2"} (2 + 96 - 3) % 96 = 95, looking at the 95-indexed entry - {{{"token"="8957072497331100633", "node"="us-west-2c-node1", "dc"="us-west-2"},}} indicating this vnode on {{us-west-2b-node2}} holds replica / data for the token range {{(8957072497331100633, MAX]}} and {{{}(MIN, -8862464739686088316){}}}. Now, looking at the token range {{(8957072497331100633, MAX]}} specifically. We should have this token range replicates in 1 node in each of the AZ. However, both {{us-west-2b-node1}} and {{{}us-west-2b-node2{}}}, in the same AZ (rack) now both hold replica / data for it. This contradicts with the rack-aware token placement where us-west-2a and us-west-2c would each hold 1 replica for this token range (RF=3 with 3 AZ/racks). In reality, the us-west-2b-node2 does not hold replica data for it. This issue is not limited to vnode, but a generic problem when we have rack aware placement of token ring. One solution is to utilize the sidecar's {{/api/v1/token-range-replicas}} endpoint Also, given we already calls the ring endpoint `api/v1/cassandra/ring`, its response includes the rack information for each node already. We can consider adding an optional new attribute to the class CassandraInstance and calculate the token range according to the replication strategy as well, following standard cassandra logic such as those specified in [https://github.com/datastax/python-driver/blob/0979b897549de4578eda31dfd9e1e1a2f080c926/cassandra/metadata.py#L581] or [https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/core/src/main/java/com/datastax/oss/driver/internal/core/metadata/token/NetworkTopologyReplicationStrategy.java#L59] > Make Token ranges calculation rack aware for > spark.data.partitioner.CassandraRing > --------------------------------------------------------------------------------- > > Key: CASSANALYTICS-79 > URL: https://issues.apache.org/jira/browse/CASSANALYTICS-79 > Project: Apache Cassandra Analytics > Issue Type: Improvement > Reporter: Liu Cao > Priority: Normal > > discussion context > [https://github.com/apache/cassandra-analytics/pull/93#issuecomment-3099771451] > > Per the [comment|#L61], the token range calculation used in Cassandra > analytics bulk read data source is not rack aware. This leads to incorrect > pick of replication target when reading SSTables. > > Calculation logic is > [here|https://github.com/apache/cassandra-analytics/blob/6e1d5257a8d6c910a42751475612145533ae3b1d/cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/RangeUtils.java#L158] > > Concrete example of breakage: > We have a cluster with 6 nodes with vnode=16, thereby creating a list of 96 > instances. > These 6 nodes reside in 3 different Availability Zones of the same AWS region > (2 nodes in each AZ). In the gist below, node UUIDs are replaced with > human-readable identifier for clarity - us-west-2a-node1, us-west-2a-node2, > etc. > The replication factor for the keyspace is 3 and we use > NetworkTopologyStrategy. > [https://gist.github.com/liucao-dd/542eb0868d2e080733ca3fe127719114] > > 1-indexed entry - > {code:java} > {"token"="-9067202222264017285", "node"="us-west-2b-node1", > "dc"="us-west-2"}{code} > (1 + 96 - 3) % 96 = 94, looking at the 94-indexed entry - > > > {code:java} > {"token"="8821921454609098249", "node"="us-west-2b-node2", > "dc"="us-west-2"}{code} > > , indicating this vnode on {{us-west-2b-node1}} holds replica / data for > token range {{(8821921454609098249, MAX]}} and {{{}(MIN, > -9067202222264017285){}}}. > 2-indexed entry > {code:java} > {"token"="-8862464739686088316", "node"="us-west-2b-node2", > "dc"="us-west-2"}{code} > (2 + 96 - 3) % 96 = 95, looking at the 95-indexed entry - > {code:java} > {"token"="8957072497331100633", "node"="us-west-2c-node1", > "dc"="us-west-2"}{code} > indicating this vnode on {{us-west-2b-node2}} holds replica / data for the > token range {{(8957072497331100633, MAX]}} and {{{}(MIN, > -8862464739686088316){}}}. > Now, looking at the token range {{(8957072497331100633, MAX]}} specifically. > We should have this token range replicates in 1 node in each of the AZ. > However, both {{us-west-2b-node1}} and {{{}us-west-2b-node2{}}}, in the same > AZ (rack) now both hold replica / data for it. This contradicts with the > rack-aware token placement where us-west-2a and us-west-2c would each hold 1 > replica for this token range (RF=3 with 3 AZ/racks). In reality, the > us-west-2b-node2 does not hold replica data for it. > > This issue is not limited to vnode, but a generic problem when we have rack > aware placement of token ring. > > One solution is to utilize the sidecar's {{/api/v1/token-range-replicas}} > endpoint > > Also, given we already calls the ring endpoint `api/v1/cassandra/ring`, its > response includes the rack information for each node already. We can consider > adding an optional new attribute to the class CassandraInstance and calculate > the token range according to the replication strategy as well, following > standard cassandra logic such as those specified in > [https://github.com/datastax/python-driver/blob/0979b897549de4578eda31dfd9e1e1a2f080c926/cassandra/metadata.py#L581] > or > [https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/core/src/main/java/com/datastax/oss/driver/internal/core/metadata/token/NetworkTopologyReplicationStrategy.java#L59] > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org