[ 
https://issues.apache.org/jira/browse/CASSANDRA-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daouda Sarr updated CASSANDRA-21450:
------------------------------------
    Attachment: processing-time-in-4.1.3.png
                processing-time-in-4.1.11.png
                processing-time-after-rollback-in-4.1.3.png

> Performance regression (read quorom) after upgrade from Cassandra 4.1.3 to 
> 4.1.11
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21450
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21450
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination, Consistency/Repair
>            Reporter: Daouda Sarr
>            Priority: Normal
>         Attachments: processing-time-after-rollback-in-4.1.3.png, 
> processing-time-in-4.1.11.png, processing-time-in-4.1.3.png
>
>
>  Apache Cassandra version
> - Baseline version: 4.1.3
> - Regressed version: 4.1.11
>  
> Cluster topology
>  
> The cluster is deployed across three datacenters:
>  
> Datacenter | Nodes | Replication Factor | Location
> ----------|-------|--------------------|---------
> DC1           | 10    | RF=2                          | Southern site
> DC2          | 10    | RF=2                          | Southern site (close 
> proximity to DC1)
> DC3          | 5     | RF=1                            | Northern site 
> (significantly farther from DC1/DC2)
>  
> Keysapce replication strategy:
> - DC1 = 2
> - DC2 = 2
> - DC3 = 1
>  
> Total replication factor = 5.
>  
> Applications are deployed only in DC1 and DC2.  
> Client drivers use contact points exclusively from DC1 and DC2. No 
> application traffic originates from DC3.
>  
> The issue is observed from application workloads running in DC1/DC2.
>  
> The cluster topology, replication settings, application deployment, and 
> client configuration remained unchanged throughout all tests.
>  
> The only variable changed between tests is the Cassandra version:
> - 4.1.3 → normal performance
> - 4.1.11 → performance degradation
> - rollback to 4.1.3 → performance restored
> - upgrade again to 4.1.11 → degradation reproduced
>  
> Description :
> After upgrading from Cassandra 4.1.3 to 4.1.11, we observed a significant 
> increase in latency for read queries.
>  
> The affected queries are generated by a batch/scanning  with CL=QUORUM
>  
> After upgrading to 4.1.11, Cassandra frequently logs slow cross-node reads, 
> seem Cassandra wait something in DC3 even if QUORUM can be achieve with 4 
> copies in DC1/DC2:
>  
> slow timeout 500 msec/cross-node with observed latencies typically between 
> 530 ms and 780 ms.
>  
> Example: "was slow 2 times: avg/min/max 761/741/780 msec"
>  
> Thousands of similar messages are generated:
> ... (6149 were dropped)
>  
>  
> The table is very small:
> - Estimated partitions: 814
> - SSTable count: 3
> - Live size: ~112 KB
> - No compaction pressure
> - No dropped mutations
> - Very low memory footprint
>  
> Observed metrics on Cassandra 4.1.11
>  
> nodetool proxyhistograms:
> - Read latency P99: ~8 ms
> - Range latency P99: ~183 ms
> - Maximum range latency: ~263 ms
>  
> nodetool tablestats:
> - Local read latency: ~0.150 ms
> - Local write latency: ~0.015 ms
> - Bloom filter false ratio: 0.00000
> - SSTable count: 3
>  
> Additional observations
>  
> - The issue does not appear related to table size, compaction, SSTables, or 
> tombstones.
> - Local read performance remains excellent.
> - The regression is observed specifically after upgrading to 4.1.11.
> - The issue disappears immediately after rolling back to 4.1.3.
>  
> Performance comparison
>  
> We have collected performance measurements and monitoring graphs for both 
> versions under comparable production workloads.
>  
> Observations:
> - 4.1.3: stable baseline, no slow query alerts
> - 4.1.11: significant increase in token range scan latency and slow query logs
> - rollback to 4.1.3: performance returns to baseline
> - re-upgrade to 4.1.11: degradation reproduced
>  
> We also have monitoring graphs showing:
> - latency (P50 / P95 / P99)
> - rate of slow queries
> - throughput (ops/sec)
> - clear correlation with Cassandra version changes
>  
> The degradation is reproducible and follows the Cassandra version changes 
> exactly. Returning to 4.1.3 restores the original performance without any 
> other infrastructure, application, schema, or configuration changes.
>  
> Expected behavior :
>  
> Performance of read queries should remain consistent between Cassandra 4.1.3 
> and 4.1.11 for identical workload and cluster topology.
>  
> Questions
>  
> 1. Are there known changes between 4.1.3 and 4.1.11 affecting:
>    - token range scan execution
>    - cross-node read coordination
>    - QUORUM read behavior in multi-DC setups
>    - speculative retry interactions with range reads
>  
> 2. Is this a known regression ?
>  
> 3. Are there specific diagnostics (tracing, JMX metrics) that could help 
> identify the root cause ?
>  
> We can provide:
> - full logs (before/after upgrade)
> - tracing outputs
> - performance graphs
> - cluster configuration snapshots



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to