[ 
https://issues.apache.org/jira/browse/CASSANDRA-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daouda Sarr updated CASSANDRA-21450:
------------------------------------
    Attachment: processing-time-after-upgrade-in-4.1.11.png

> Performance regression (read quorom) after upgrade from Cassandra 4.1.3 to 
> 4.1.11
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21450
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21450
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination, Consistency/Repair
>            Reporter: Daouda Sarr
>            Priority: Normal
>         Attachments: processing-time-after-rollback-in-4.1.3.png, 
> processing-time-after-upgrade-in-4.1.11.png, processing-time-in-4.1.11.png, 
> processing-time-in-4.1.3.png
>
>
>  Apache Cassandra version
>  - Baseline version: 4.1.3
>  - Regressed version: 4.1.11
>  
> Cluster topology
>  
> The cluster is deployed across three datacenters:
>  
> Datacenter | Nodes | Replication Factor | Location
> ---------{-}|{-}-----{-}|{-}------------------{-}|{-}--------
> DC1           | 10    | RF=2                          | Southern site
> DC2          | 10    | RF=2                          | Southern site (close 
> proximity to DC1)
> DC3          | 5     | RF=1                            | Northern site 
> (significantly farther from DC1/DC2)
>  
> Keysapce replication strategy:
>  - DC1 = 2
>  - DC2 = 2
>  - DC3 = 1
>  
> Total replication factor = 5.
>  
> Applications are deployed only in DC1 and DC2.  
> Client drivers use contact points exclusively from DC1 and DC2. No 
> application traffic originates from DC3.
>  
> The issue is observed from application workloads running in DC1/DC2.
>  
> The cluster topology, replication settings, application deployment, and 
> client configuration remained unchanged throughout all tests.
>  
> The only variable changed between tests is the Cassandra version:
>  - 4.1.3 → normal performance
>  - 4.1.11 → performance degradation
>  - rollback to 4.1.3 → performance restored
>  - upgrade again to 4.1.11 → degradation reproduced
>  
> Description :
> After upgrading from Cassandra 4.1.3 to 4.1.11, we observed a significant 
> increase in latency for read queries.
>  
> The affected queries are generated by a batch/scanning  with CL=QUORUM
>  
> After upgrading to 4.1.11, Cassandra frequently logs slow cross-node reads, 
> seem Cassandra wait something in DC3 even if QUORUM can be achieve with 4 
> copies in DC1/DC2:
>  
> slow timeout 500 msec/cross-node with observed latencies typically between 
> 530 ms and 780 ms.
>  
> Example: "was slow 2 times: avg/min/max 761/741/780 msec"
>  
> Thousands of similar messages are generated:
> ... (6149 were dropped)
>  
>  
> The table is very small:
>  - Estimated partitions: 814
>  - SSTable count: 3
>  - Live size: ~112 KB
>  - No compaction pressure
>  - No dropped mutations
>  - Very low memory footprint
>  
> Observed metrics on Cassandra 4.1.11
>  
> nodetool proxyhistograms:
>  - Read latency P99: ~8 ms
>  - Range latency P99: ~183 ms
>  - Maximum range latency: ~263 ms
>  
> nodetool tablestats:
>  - Local read latency: ~0.150 ms
>  - Local write latency: ~0.015 ms
>  - Bloom filter false ratio: 0.00000
>  - SSTable count: 3
>  
> Additional observations
>  
>  - The issue does not appear related to table size, compaction, SSTables, or 
> tombstones.
>  - Local read performance remains excellent.
>  - The regression is observed specifically after upgrading to 4.1.11.
>  - The issue disappears immediately after rolling back to 4.1.3.
>  
> Performance comparison
>  
> We have collected performance measurements and monitoring graphs for both 
> versions under comparable production workloads.
>  
> Observations:
>  - 4.1.3: stable baseline, no slow query alerts
>  - 4.1.11: significant increase in token range scan latency and slow query 
> logs
>  - rollback to 4.1.3: performance returns to baseline
>  - re-upgrade to 4.1.11: degradation reproduced
>  
> We also have monitoring graphs showing:
>  - latency (P50 / P95 / P99)
>  - rate of slow queries
>  - throughput (ops/sec)
>  - clear correlation with Cassandra version changes
>  
> The degradation is reproducible and follows the Cassandra version changes 
> exactly. Returning to 4.1.3 restores the original performance without any 
> other infrastructure, application, schema, or configuration changes.
>  
> Expected behavior :
>  
> Performance of read queries should remain consistent between Cassandra 4.1.3 
> and 4.1.11 for identical workload and cluster topology.
>  
> Questions
>  
> 1. Are there known changes between 4.1.3 and 4.1.11 affecting:
>    - token range scan execution
>    - cross-node read coordination
>    - QUORUM read behavior in multi-DC setups
>    - speculative retry interactions with range reads
>  
> 2. Is this a known regression ?
>  
> 3. Are there specific diagnostics (tracing, JMX metrics) that could help 
> identify the root cause ?
> We are trying to determine whether the observed behavior is:
> - an unintended regression introduced in 4.1.11, or
> - a deliberate change in behavior (performance vs correctness trade-off)
> If this is an intentional evolution, could you confirm:
> - whether it is expected to remain in future 4.1.x releases
> - and whether the same behavior is present in Cassandra 5.0
> This information is important for us to decide whether we need to adapt our 
> architecture or whether this should be considered a regression.
>  We can provide:
>  - full logs (before/after upgrade)
>  - tracing outputs
>  - performance graphs
>  - cluster configuration snapshots



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to