[
https://issues.apache.org/jira/browse/CASSANDRA-19365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jakub Zytka updated CASSANDRA-19365:
------------------------------------
Description:
`DecayingEstimatedHistogramReservoir` has a race condition between `update` and
`rescaleIfNeeded`.
A sample which ends up (`update`) in an already scaled decayingBucket
(`rescaleIfNeeded`) may still use a non-scaled weight because `decayLandmark`
has not been updated yet at the moment of `update`.
The observed consequence was flooding of the cluster with speculative retries
(we happened to hit low-percentile buckets with overweight samples, which drove
p99 below true p50 for a long time).
Please note that despite the manifestation being similar to CASSANDRA-19330,
these are two distinct bugs in their own right.
This bug affects versions 4.0+
On 3.11 there's locking in DEHR. I did not check earlier versions.
was:
`DecayingEstimatedHistogramReservoir` has a race condition between `update` and
`rescaleIfNeeded`.
A sample which ends up (`update`) in an already scaled decayingBucket
(`rescaleIfNeeded`) may still use a non-scaled weight because `decayLandmark`
has not been updated yet at the moment of `update`.
The observed consequence was flooding of the cluster with speculative retries
(we happened to hit low-percentile buckets with overweight samples, which drove
p99 below true p50 for a long time).
Please note that despite the manifestation being similar to CASSANDRA-19330,
these are two distinct bugs in their own right.
This bug affects
> invalid EstimatedHistogramReservoirSnapshot::getValue values due to race
> condition in DecayingEstimatedHistogramReservoir
> -------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-19365
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19365
> Project: Cassandra
> Issue Type: Bug
> Components: Observability/Metrics
> Reporter: Jakub Zytka
> Assignee: Jakub Zytka
> Priority: Normal
> Fix For: 5.x
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> `DecayingEstimatedHistogramReservoir` has a race condition between `update`
> and `rescaleIfNeeded`.
> A sample which ends up (`update`) in an already scaled decayingBucket
> (`rescaleIfNeeded`) may still use a non-scaled weight because `decayLandmark`
> has not been updated yet at the moment of `update`.
>
> The observed consequence was flooding of the cluster with speculative retries
> (we happened to hit low-percentile buckets with overweight samples, which
> drove p99 below true p50 for a long time).
> Please note that despite the manifestation being similar to CASSANDRA-19330,
> these are two distinct bugs in their own right.
> This bug affects versions 4.0+
> On 3.11 there's locking in DEHR. I did not check earlier versions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]