[jira] [Comment Edited] (FLINK-29825) Improve benchmark stability

Piotr Nowojski (Jira) Mon, 06 Feb 2023 05:29:11 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684677#comment-17684677
 ]


Piotr Nowojski edited comment on FLINK-29825 at 2/6/23 1:29 PM:
----------------------------------------------------------------

I have responded on the dev mailing list, but let's maybe move the discussion 
here.

[~lindong] , the Kolmogorov-Smirnov test was a just a result of a quick google 
search for relevant mathematical concepts. I have a feeling that it could be 
adapted to something that would work for us. For example instead of checking 
the supremum between two empirical distribution functions (EDF), we could add 
up differences between those distribution functions. If the new EDF has on 
lower values, the sum of differences would be negative, that would point toward 
a regression. But maybe there are better approaches.

I think the drawback of your proposal is that it wouldn't detect if there is 
visible performance regression within benchmark noise. While this should be 
do-able with large enough number of samples (ad for example described above). 
For example if the results are oscillating randomly around 1000 ({+}/- 150), 
and there is performance regression that changes the result to 900 ({+}/- 135). 
And we have quite a lot of noisy benchmarks, like 
[this|http://codespeed.dak8s.net:8000/timeline/?ben=fireProcessingTimers&env=2] 
or [this|http://codespeed.dak8s.net:8000/timeline/?ben=serializerTuple&env=2]. 

I was also informed about some tooling created exactly for detecting 
performance regressions from benchmark results: 

> fork of  Hunter - a perf change detection tool, originally from DataStax:
> Blog post - 
> [https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4]
> Paper - [https://arxiv.org/pdf/2301.03034.pdf]
> Our fork - [https://github.com/gerrrr/hunter]

The algorithm that's used underneath "E-divisive Means" sounds promising. 


was (Author: pnowojski):
I have responded on the dev mailing list, but let's maybe move the discussion 
here.

[~lindong] , the Kolmogorov-Smirnov test was a just a result of a quick google 
search for relevant mathematical concepts. I have a feeling that it could be 
adapted to something that would work for us. For example instead of checking 
the supremum between two empirical distribution functions (EDF), we could add 
up differences between those distribution functions. If the new EDF has on 
lower values, the sum of differences would be negative, that would point toward 
a regression. But maybe there are better approaches.

I think the drawback of your proposal is that it wouldn't detect if there is 
visible performance regression within benchmark noise. While this should be 
do-able with large enough number of samples. For example if the results are 
oscillating randomly around 1000 (+/- 150), and there is performance regression 
that changes the result to 900 (+/- 135). And we have quite a lot of noisy 
benchmarks, like 
[this|http://codespeed.dak8s.net:8000/timeline/?ben=fireProcessingTimers&env=2] 
or [this|http://codespeed.dak8s.net:8000/timeline/?ben=serializerTuple&env=2]. 

I was also informed about some tooling created exactly for detecting 
performance regressions from benchmark results: 

> fork of  Hunter - a perf change detection tool, originally from DataStax:
> Blog post - 
> [https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4]
> Paper - [https://arxiv.org/pdf/2301.03034.pdf]
> Our fork - [https://github.com/gerrrr/hunter]

The algorithm that's used underneath "E-divisive Means" sounds promising. 

> Improve benchmark stability
> ---------------------------
>
>                 Key: FLINK-29825
>                 URL: https://issues.apache.org/jira/browse/FLINK-29825
>             Project: Flink
>          Issue Type: Improvement
>          Components: Benchmarks
>    Affects Versions: 1.17.0
>            Reporter: Yanfei Lei
>            Assignee: Yanfei Lei
>            Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-29825) Improve benchmark stability

Reply via email to