[
https://issues.apache.org/jira/browse/CASSANDRA-19365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maxim Muzafarov updated CASSANDRA-19365:
----------------------------------------
Test and Documentation Plan: Documentation is not required. Run tests to
verify the patch. (was: unit test )
Status: Patch Available (was: In Progress)
Changes are ready for review. I've added benchmarks and improved the
consistency so we won't lose any updates as previously mentioned. The
corresponding Javadoc has also been updated to reflect that no locking is used
for the {{decayingBuckets}} reset.
I'll prepare CI shortly.
> This lets us keep updates non-synchronized at the price of letting some
> updates be missed during rescale.
that's no longer relevant
Benchmarks:
{code:java}
// The "Mode" is LANDMARK_RESET_INTERVAL_IN_NS in nanoseconds.
cassandra-19365
Benchmark (landmarkResetIntervalNs) Mode Cnt
Score Error Units
DecayingEstimatedHistogramBench.update 100000 thrpt 12
14995,732 ± 815,913 ops/ms
DecayingEstimatedHistogramBench.update 500000 thrpt 12
14290,593 ± 669,975 ops/ms
DecayingEstimatedHistogramBench.update 1000000 thrpt 12
14648,427 ± 800,957 ops/ms
trunk
Benchmark (landmarkResetIntervalNs) Mode Cnt
Score Error Units
DecayingEstimatedHistogramBench.update 100000 thrpt 12
14236,466 ± 1203,280 ops/ms
DecayingEstimatedHistogramBench.update 500000 thrpt 12
13746,524 ± 1908,030 ops/ms
DecayingEstimatedHistogramBench.update 1000000 thrpt 12
14048,394 ± 676,323 ops/ms
{code}
> invalid EstimatedHistogramReservoirSnapshot::getValue values due to race
> condition in DecayingEstimatedHistogramReservoir
> -------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-19365
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19365
> Project: Cassandra
> Issue Type: Bug
> Components: Observability/Metrics
> Reporter: Jakub Zytka
> Assignee: Maxim Muzafarov
> Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Time Spent: 3.5h
> Remaining Estimate: 0h
>
> `DecayingEstimatedHistogramReservoir` has a race condition between `update`
> and `rescaleIfNeeded`.
> A sample which ends up (`update`) in an already scaled decayingBucket
> (`rescaleIfNeeded`) may still use a non-scaled weight because `decayLandmark`
> has not been updated yet at the moment of `update`.
>
> The observed consequence was flooding of the cluster with speculative retries
> (we happened to hit low-percentile buckets with overweight samples, which
> drove p99 below true p50 for a long time).
> Please note that despite the manifestation being similar to CASSANDRA-19330,
> these are two distinct bugs in their own right.
> This bug affects versions 4.0+
> On 3.11 there's locking in DEHR. I did not check earlier versions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]