[
https://issues.apache.org/jira/browse/CASSANDRA-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daouda Sarr updated CASSANDRA-21450:
------------------------------------
Attachment: processing-time-after-upgrade-in-4.1.11.png
> Performance regression (read quorom) after upgrade from Cassandra 4.1.3 to
> 4.1.11
> ---------------------------------------------------------------------------------
>
> Key: CASSANDRA-21450
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21450
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Consistency/Coordination, Consistency/Repair
> Reporter: Daouda Sarr
> Priority: Normal
> Attachments: processing-time-after-rollback-in-4.1.3.png,
> processing-time-after-upgrade-in-4.1.11.png, processing-time-in-4.1.11.png,
> processing-time-in-4.1.3.png
>
>
> Apache Cassandra version
> - Baseline version: 4.1.3
> - Regressed version: 4.1.11
>
> Cluster topology
>
> The cluster is deployed across three datacenters:
>
> Datacenter | Nodes | Replication Factor | Location
> ---------{-}|{-}-----{-}|{-}------------------{-}|{-}--------
> DC1 | 10 | RF=2 | Southern site
> DC2 | 10 | RF=2 | Southern site (close
> proximity to DC1)
> DC3 | 5 | RF=1 | Northern site
> (significantly farther from DC1/DC2)
>
> Keysapce replication strategy:
> - DC1 = 2
> - DC2 = 2
> - DC3 = 1
>
> Total replication factor = 5.
>
> Applications are deployed only in DC1 and DC2.
> Client drivers use contact points exclusively from DC1 and DC2. No
> application traffic originates from DC3.
>
> The issue is observed from application workloads running in DC1/DC2.
>
> The cluster topology, replication settings, application deployment, and
> client configuration remained unchanged throughout all tests.
>
> The only variable changed between tests is the Cassandra version:
> - 4.1.3 → normal performance
> - 4.1.11 → performance degradation
> - rollback to 4.1.3 → performance restored
> - upgrade again to 4.1.11 → degradation reproduced
>
> Description :
> After upgrading from Cassandra 4.1.3 to 4.1.11, we observed a significant
> increase in latency for read queries.
>
> The affected queries are generated by a batch/scanning with CL=QUORUM
>
> After upgrading to 4.1.11, Cassandra frequently logs slow cross-node reads,
> seem Cassandra wait something in DC3 even if QUORUM can be achieve with 4
> copies in DC1/DC2:
>
> slow timeout 500 msec/cross-node with observed latencies typically between
> 530 ms and 780 ms.
>
> Example: "was slow 2 times: avg/min/max 761/741/780 msec"
>
> Thousands of similar messages are generated:
> ... (6149 were dropped)
>
>
> The table is very small:
> - Estimated partitions: 814
> - SSTable count: 3
> - Live size: ~112 KB
> - No compaction pressure
> - No dropped mutations
> - Very low memory footprint
>
> Observed metrics on Cassandra 4.1.11
>
> nodetool proxyhistograms:
> - Read latency P99: ~8 ms
> - Range latency P99: ~183 ms
> - Maximum range latency: ~263 ms
>
> nodetool tablestats:
> - Local read latency: ~0.150 ms
> - Local write latency: ~0.015 ms
> - Bloom filter false ratio: 0.00000
> - SSTable count: 3
>
> Additional observations
>
> - The issue does not appear related to table size, compaction, SSTables, or
> tombstones.
> - Local read performance remains excellent.
> - The regression is observed specifically after upgrading to 4.1.11.
> - The issue disappears immediately after rolling back to 4.1.3.
>
> Performance comparison
>
> We have collected performance measurements and monitoring graphs for both
> versions under comparable production workloads.
>
> Observations:
> - 4.1.3: stable baseline, no slow query alerts
> - 4.1.11: significant increase in token range scan latency and slow query
> logs
> - rollback to 4.1.3: performance returns to baseline
> - re-upgrade to 4.1.11: degradation reproduced
>
> We also have monitoring graphs showing:
> - latency (P50 / P95 / P99)
> - rate of slow queries
> - throughput (ops/sec)
> - clear correlation with Cassandra version changes
>
> The degradation is reproducible and follows the Cassandra version changes
> exactly. Returning to 4.1.3 restores the original performance without any
> other infrastructure, application, schema, or configuration changes.
>
> Expected behavior :
>
> Performance of read queries should remain consistent between Cassandra 4.1.3
> and 4.1.11 for identical workload and cluster topology.
>
> Questions
>
> 1. Are there known changes between 4.1.3 and 4.1.11 affecting:
> - token range scan execution
> - cross-node read coordination
> - QUORUM read behavior in multi-DC setups
> - speculative retry interactions with range reads
>
> 2. Is this a known regression ?
>
> 3. Are there specific diagnostics (tracing, JMX metrics) that could help
> identify the root cause ?
> We are trying to determine whether the observed behavior is:
> - an unintended regression introduced in 4.1.11, or
> - a deliberate change in behavior (performance vs correctness trade-off)
> If this is an intentional evolution, could you confirm:
> - whether it is expected to remain in future 4.1.x releases
> - and whether the same behavior is present in Cassandra 5.0
> This information is important for us to decide whether we need to adapt our
> architecture or whether this should be considered a regression.
> We can provide:
> - full logs (before/after upgrade)
> - tracing outputs
> - performance graphs
> - cluster configuration snapshots
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]