[ 
https://issues.apache.org/jira/browse/CASSANALYTICS-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Cao updated CASSANALYTICS-79:
---------------------------------
    Description: 
discussion context 
[https://github.com/apache/cassandra-analytics/pull/93#issuecomment-3099771451]

 

Per the [comment|#L61], the token range calculation used in Cassandra analytics 
bulk read data source is not rack aware. This leads to incorrect pick of 
replication target when reading SSTables.

 

Calculation logic is 
[here|https://github.com/apache/cassandra-analytics/blob/6e1d5257a8d6c910a42751475612145533ae3b1d/cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/RangeUtils.java#L158]

 

Concrete example of breakage:

We have a cluster with 6 nodes with vnode=16, thereby creating a list of 96 
instances.
These 6 nodes reside in 3 different Availability Zones of the same AWS region 
(2 nodes in each AZ). In the gist below, node UUIDs are replaced with 
human-readable identifier for clarity - us-west-2a-node1, us-west-2a-node2, etc.

The replication factor for the keyspace is 3 and we use NetworkTopologyStrategy.

[https://gist.github.com/liucao-dd/542eb0868d2e080733ca3fe127719114]

 

1-indexed entry - `\{"token"="-9067202222264017285", "node"="us-west-2b-node1", 
"dc"="us-west-2"}`,
(1 + 96 - 3) % 96 = 94, looking at the 94-indexed entry - 
{{{}{"token"="8821921454609098249", "node"="us-west-2b-node2", 
"dc"="us-west-2"}{}}}, indicating this vnode on {{us-west-2b-node1}} holds 
replica / data for token range {{(8821921454609098249, MAX]}} and {{{}(MIN, 
-9067202222264017285){}}}.

2-indexed entry - \{"token"="-8862464739686088316", "node"="us-west-2b-node2", 
"dc"="us-west-2"}

(2 + 96 - 3) % 96 = 95, looking at the 95-indexed entry - 
{{{"token"="8957072497331100633", "node"="us-west-2c-node1", 
"dc"="us-west-2"},}} indicating this vnode on {{us-west-2b-node2}} holds 
replica / data for the token range {{(8957072497331100633, MAX]}} and {{{}(MIN, 
-8862464739686088316){}}}.

Now, looking at the token range {{(8957072497331100633, MAX]}} specifically. We 
should have this token range replicates in 1 node in each of the AZ. However, 
both {{us-west-2b-node1}} and {{{}us-west-2b-node2{}}}, in the same AZ (rack) 
now both hold replica / data for it. This contradicts with the rack-aware token 
placement where us-west-2a and us-west-2c would each hold 1 replica for this 
token range (RF=3 with 3 AZ/racks). In reality, the us-west-2b-node2 does not 
hold replica data for it.

 

This issue is not limited to vnode, but a generic problem when we have rack 
aware placement of token ring.

 

One solution is to utilize the sidecar's {{/api/v1/token-range-replicas}} 
endpoint

 

Also, given we already calls the ring endpoint `api/v1/cassandra/ring`, its 
response includes the rack information for each node already. We can consider 
adding an optional new attribute to the class CassandraInstance and calculate 
the token range according to the replication strategy as well, following 
standard cassandra logic such as those specified in 
[https://github.com/datastax/python-driver/blob/0979b897549de4578eda31dfd9e1e1a2f080c926/cassandra/metadata.py#L581]
 or 

[https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/core/src/main/java/com/datastax/oss/driver/internal/core/metadata/token/NetworkTopologyReplicationStrategy.java#L59]

 

 

 

 

 

  was:
discussion context 
[https://github.com/apache/cassandra-analytics/pull/93#issuecomment-3099771451]

 

Per the [comment|#L61], the token range calculation used in Cassandra analytics 
bulk read data source is not rack aware. This leads to incorrect pick of 
replication target when reading SSTables.

 

Calculation logic is 
[here|https://github.com/apache/cassandra-analytics/blob/6e1d5257a8d6c910a42751475612145533ae3b1d/cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/RangeUtils.java#L158]

 

Concrete example of breakage:

We have a cluster with 6 nodes with vnode=16, thereby creating a list of 96 
instances.
These 6 nodes reside in 3 different Availability Zones of the same AWS region 
(2 nodes in each AZ). In the gist below, node UUIDs are replaced with 
human-readable identifier for clarity - us-west-2a-node1, us-west-2a-node2, etc.

The replication factor for the keyspace is 3 and we use NetworkTopologyStrategy.

[https://gist.github.com/liucao-dd/542eb0868d2e080733ca3fe127719114]

 

1-indexed entry - {{ \{"token"="-9067202222264017285", 
"node"="us-west-2b-node1", "dc"="us-west-2"},}}
(1 + 96 - 3) % 96 = 94, looking at the 94-indexed entry - {{{} 
\{"token"="8821921454609098249", "node"="us-west-2b-node2", 
"dc"="us-west-2"}{}}}, indicating this vnode on {{us-west-2b-node1}} holds 
replica / data for token range {{(8821921454609098249, MAX]}} and {{{}(MIN, 
-9067202222264017285){}}}.

2-indexed entry - {{{"token"="-8862464739686088316", "node"="us-west-2b-node2", 
"dc"="us-west-2"},}}
(2 + 96 - 3) % 96 = 95, looking at the 95-indexed entry - {{{} 
\{"token"="8957072497331100633", "node"="us-west-2c-node1", 
"dc"="us-west-2"}{}}}, indicating this vnode on {{us-west-2b-node2}} holds 
replica / data for the token range {{(8957072497331100633, MAX]}} and {{{}(MIN, 
-8862464739686088316){}}}.

Now, looking at the token range {{(8957072497331100633, MAX]}} specifically. We 
should have this token range replicates in 1 node in each of the AZ. However, 
both {{us-west-2b-node1}} and {{{}us-west-2b-node2{}}}, in the same AZ (rack) 
now both hold replica / data for it. This contradicts with the rack-aware token 
placement where us-west-2a and us-west-2c would each hold 1 replica for this 
token range (RF=3 with 3 AZ/racks). In reality, the us-west-2b-node2 does not 
hold replica data for it.

 

This issue is not limited to vnode, but a generic problem when we have rack 
aware placement of token ring.

 

One solution is to utilize the sidecar's {{/api/v1/token-range-replicas}} 
endpoint

 

Also, given we already calls the ring endpoint for keyspace, its response 
include the rack information for each node already. We can consider adding an 
optional new attribute to the class CassandraInstance and calculate the token 
range according to the replication strategy as well, following standard 
cassandra logic such as those specified in 
[https://github.com/datastax/python-driver/blob/0979b897549de4578eda31dfd9e1e1a2f080c926/cassandra/metadata.py#L581]
 or 

[https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/core/src/main/java/com/datastax/oss/driver/internal/core/metadata/token/NetworkTopologyReplicationStrategy.java#L59]

 

 

 

 

 


> Make Token ranges calculation rack aware for 
> spark.data.partitioner.CassandraRing
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANALYTICS-79
>                 URL: https://issues.apache.org/jira/browse/CASSANALYTICS-79
>             Project: Apache Cassandra Analytics
>          Issue Type: Improvement
>            Reporter: Liu Cao
>            Priority: Normal
>
> discussion context 
> [https://github.com/apache/cassandra-analytics/pull/93#issuecomment-3099771451]
>  
> Per the [comment|#L61], the token range calculation used in Cassandra 
> analytics bulk read data source is not rack aware. This leads to incorrect 
> pick of replication target when reading SSTables.
>  
> Calculation logic is 
> [here|https://github.com/apache/cassandra-analytics/blob/6e1d5257a8d6c910a42751475612145533ae3b1d/cassandra-analytics-common/src/main/java/org/apache/cassandra/spark/utils/RangeUtils.java#L158]
>  
> Concrete example of breakage:
> We have a cluster with 6 nodes with vnode=16, thereby creating a list of 96 
> instances.
> These 6 nodes reside in 3 different Availability Zones of the same AWS region 
> (2 nodes in each AZ). In the gist below, node UUIDs are replaced with 
> human-readable identifier for clarity - us-west-2a-node1, us-west-2a-node2, 
> etc.
> The replication factor for the keyspace is 3 and we use 
> NetworkTopologyStrategy.
> [https://gist.github.com/liucao-dd/542eb0868d2e080733ca3fe127719114]
>  
> 1-indexed entry - `\{"token"="-9067202222264017285", 
> "node"="us-west-2b-node1", "dc"="us-west-2"}`,
> (1 + 96 - 3) % 96 = 94, looking at the 94-indexed entry - 
> {{{}{"token"="8821921454609098249", "node"="us-west-2b-node2", 
> "dc"="us-west-2"}{}}}, indicating this vnode on {{us-west-2b-node1}} holds 
> replica / data for token range {{(8821921454609098249, MAX]}} and {{{}(MIN, 
> -9067202222264017285){}}}.
> 2-indexed entry - \{"token"="-8862464739686088316", 
> "node"="us-west-2b-node2", "dc"="us-west-2"}
> (2 + 96 - 3) % 96 = 95, looking at the 95-indexed entry - 
> {{{"token"="8957072497331100633", "node"="us-west-2c-node1", 
> "dc"="us-west-2"},}} indicating this vnode on {{us-west-2b-node2}} holds 
> replica / data for the token range {{(8957072497331100633, MAX]}} and 
> {{{}(MIN, -8862464739686088316){}}}.
> Now, looking at the token range {{(8957072497331100633, MAX]}} specifically. 
> We should have this token range replicates in 1 node in each of the AZ. 
> However, both {{us-west-2b-node1}} and {{{}us-west-2b-node2{}}}, in the same 
> AZ (rack) now both hold replica / data for it. This contradicts with the 
> rack-aware token placement where us-west-2a and us-west-2c would each hold 1 
> replica for this token range (RF=3 with 3 AZ/racks). In reality, the 
> us-west-2b-node2 does not hold replica data for it.
>  
> This issue is not limited to vnode, but a generic problem when we have rack 
> aware placement of token ring.
>  
> One solution is to utilize the sidecar's {{/api/v1/token-range-replicas}} 
> endpoint
>  
> Also, given we already calls the ring endpoint `api/v1/cassandra/ring`, its 
> response includes the rack information for each node already. We can consider 
> adding an optional new attribute to the class CassandraInstance and calculate 
> the token range according to the replication strategy as well, following 
> standard cassandra logic such as those specified in 
> [https://github.com/datastax/python-driver/blob/0979b897549de4578eda31dfd9e1e1a2f080c926/cassandra/metadata.py#L581]
>  or 
> [https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/core/src/main/java/com/datastax/oss/driver/internal/core/metadata/token/NetworkTopologyReplicationStrategy.java#L59]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to