[jira] [Updated] (CASSANDRA-19365) invalid EstimatedHistogramReservoirSnapshot::getValue values due to race condition in DecayingEstimatedHistogramReservoir

Maxim Muzafarov (Jira) Thu, 12 Sep 2024 11:11:12 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-19365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Maxim Muzafarov updated CASSANDRA-19365:
----------------------------------------
    Test and Documentation Plan: Documentation is not required. Run tests to 
verify the patch.  (was: unit test )
                         Status: Patch Available  (was: In Progress)

Changes are ready for review. I've added benchmarks and improved the 
consistency so we won't lose any updates as previously mentioned. The 
corresponding Javadoc has also been updated to reflect that no locking is used 
for the {{decayingBuckets}} reset.
I'll prepare CI shortly.



> This lets us keep updates non-synchronized at the price of letting some 
> updates be missed during rescale.

that's no longer relevant



Benchmarks:
{code:java}
// The "Mode" is LANDMARK_RESET_INTERVAL_IN_NS in nanoseconds.

cassandra-19365

Benchmark                               (landmarkResetIntervalNs)   Mode  Cnt   
   Score     Error   Units
DecayingEstimatedHistogramBench.update                     100000  thrpt   12  
14995,732 ± 815,913  ops/ms
DecayingEstimatedHistogramBench.update                     500000  thrpt   12  
14290,593 ± 669,975  ops/ms
DecayingEstimatedHistogramBench.update                    1000000  thrpt   12  
14648,427 ± 800,957  ops/ms

trunk

Benchmark                               (landmarkResetIntervalNs)   Mode  Cnt   
   Score      Error   Units
DecayingEstimatedHistogramBench.update                     100000  thrpt   12  
14236,466 ± 1203,280  ops/ms
DecayingEstimatedHistogramBench.update                     500000  thrpt   12  
13746,524 ± 1908,030  ops/ms
DecayingEstimatedHistogramBench.update                    1000000  thrpt   12  
14048,394 ±  676,323  ops/ms
{code}

> invalid EstimatedHistogramReservoirSnapshot::getValue values due to race 
> condition in DecayingEstimatedHistogramReservoir
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19365
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19365
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Observability/Metrics
>            Reporter: Jakub Zytka
>            Assignee: Maxim Muzafarov
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> `DecayingEstimatedHistogramReservoir` has a race condition between `update` 
> and `rescaleIfNeeded`.
> A sample which ends up (`update`) in an already scaled decayingBucket 
> (`rescaleIfNeeded`) may still use a non-scaled weight because `decayLandmark` 
> has not been updated yet at the moment of `update`.
>  
> The observed consequence was flooding of the cluster with speculative retries 
> (we happened to hit low-percentile buckets with overweight samples, which 
> drove p99 below true p50 for a long time).
> Please note that despite the manifestation being similar to CASSANDRA-19330, 
> these are two distinct bugs in their own right.
> This bug affects versions 4.0+
> On 3.11 there's locking in DEHR. I did not check earlier versions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-19365) invalid EstimatedHistogramReservoirSnapshot::getValue values due to race condition in DecayingEstimatedHistogramReservoir

Reply via email to