[
https://issues.apache.org/jira/browse/CASSANDRA-19365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817370#comment-17817370
]
Jakub Zytka edited comment on CASSANDRA-19365 at 2/14/24 1:02 PM:
------------------------------------------------------------------
[https://github.com/apache/cassandra/pull/3102/files]
The proposed PR keeps changes to almost a minimum to stay consistent with the
current state of DEHR.
The solution doesn't introduce synchronization between updates and rescales for
the sake of update performance. Instead, it introduces an atomic change of
decay landmark and decaying buckets together. This lets us keep updates
non-synchronized at the price of letting some updates be missed during rescale.
It also allows the creation of snapshots that are not half-rescaled.
was (Author: jakubzytka):
[https://github.com/apache/cassandra/pull/3102/files]
The proposed PR keeps changes to almost a minimum to stay consistent with the
current state of DEHR.
> invalid EstimatedHistogramReservoirSnapshot::getValue values due to race
> condition in DecayingEstimatedHistogramReservoir
> -------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-19365
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19365
> Project: Cassandra
> Issue Type: Bug
> Components: Observability/Metrics
> Reporter: Jakub Zytka
> Assignee: Jakub Zytka
> Priority: Normal
> Time Spent: 20m
> Remaining Estimate: 0h
>
> `DecayingEstimatedHistogramReservoir` has a race condition between `update`
> and `rescaleIfNeeded`.
> A sample which ends up (`update`) in an already scaled decayingBucket
> (`rescaleIfNeeded`) may still use a non-scaled weight because `decayLandmark`
> has not been updated yet at the moment of `update`.
>
> The observed consequence was flooding of the cluster with speculative retries
> (we happened to hit low-percentile buckets with overweight samples, which
> drove p99 below true p50 for a long time).
> Please note that despite the manifestation being similar to CASSANDRA-19330,
> these are two distinct bugs in their own right.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]