[
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684677#comment-17684677
]
Piotr Nowojski edited comment on FLINK-29825 at 2/6/23 1:29 PM:
----------------------------------------------------------------
I have responded on the dev mailing list, but let's maybe move the discussion
here.
[~lindong] , the Kolmogorov-Smirnov test was a just a result of a quick google
search for relevant mathematical concepts. I have a feeling that it could be
adapted to something that would work for us. For example instead of checking
the supremum between two empirical distribution functions (EDF), we could add
up differences between those distribution functions. If the new EDF has on
lower values, the sum of differences would be negative, that would point toward
a regression. But maybe there are better approaches.
I think the drawback of your proposal is that it wouldn't detect if there is
visible performance regression within benchmark noise. While this should be
do-able with large enough number of samples (ad for example described above).
For example if the results are oscillating randomly around 1000 ({+}/- 150),
and there is performance regression that changes the result to 900 ({+}/- 135).
And we have quite a lot of noisy benchmarks, like
[this|http://codespeed.dak8s.net:8000/timeline/?ben=fireProcessingTimers&env=2]
or [this|http://codespeed.dak8s.net:8000/timeline/?ben=serializerTuple&env=2].
I was also informed about some tooling created exactly for detecting
performance regressions from benchmark results:
> fork of Hunter - a perf change detection tool, originally from DataStax:
> Blog post -
> [https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4]
> Paper - [https://arxiv.org/pdf/2301.03034.pdf]
> Our fork - [https://github.com/gerrrr/hunter]
The algorithm that's used underneath "E-divisive Means" sounds promising.
was (Author: pnowojski):
I have responded on the dev mailing list, but let's maybe move the discussion
here.
[~lindong] , the Kolmogorov-Smirnov test was a just a result of a quick google
search for relevant mathematical concepts. I have a feeling that it could be
adapted to something that would work for us. For example instead of checking
the supremum between two empirical distribution functions (EDF), we could add
up differences between those distribution functions. If the new EDF has on
lower values, the sum of differences would be negative, that would point toward
a regression. But maybe there are better approaches.
I think the drawback of your proposal is that it wouldn't detect if there is
visible performance regression within benchmark noise. While this should be
do-able with large enough number of samples. For example if the results are
oscillating randomly around 1000 (+/- 150), and there is performance regression
that changes the result to 900 (+/- 135). And we have quite a lot of noisy
benchmarks, like
[this|http://codespeed.dak8s.net:8000/timeline/?ben=fireProcessingTimers&env=2]
or [this|http://codespeed.dak8s.net:8000/timeline/?ben=serializerTuple&env=2].
I was also informed about some tooling created exactly for detecting
performance regressions from benchmark results:
> fork of Hunter - a perf change detection tool, originally from DataStax:
> Blog post -
> [https://medium.com/building-the-open-data-stack/detecting-performance-regressions-with-datastax-hunter-c22dc444aea4]
> Paper - [https://arxiv.org/pdf/2301.03034.pdf]
> Our fork - [https://github.com/gerrrr/hunter]
The algorithm that's used underneath "E-divisive Means" sounds promising.
> Improve benchmark stability
> ---------------------------
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
> Issue Type: Improvement
> Components: Benchmarks
> Affects Versions: 1.17.0
> Reporter: Yanfei Lei
> Assignee: Yanfei Lei
> Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false
> positives and false negatives, especially for benchmarks with small absolute
> values, small value changes would cause large percentage changes. see
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
> for details.
> And all benchmarks are executed on one physical machine, it might happen that
> hardware issues affect performance, like "[FLINK-18614] Performance
> regression 2020.07.13".
>
> This ticket aims to improve the precision and recall of the regression-check
> script.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)