[
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684999#comment-17684999
]
Dong Lin commented on FLINK-29825:
----------------------------------
[~pnowojski] I think one drawback with your proposal is that it is comparing
two distributions and depends on having large enough number of samples in both
distributions. It means that if a benchmark happens, you need to run engouh
commit-points so that the recent distribution starts to be considerably
different from the previous distribution according to Kolmogorov-Smirnov test.
This would considerably delay the time-to-regression-detection. It seems that
my proposal would not suffer from this issue since it allows users to specify
how many commit-points we need to repeat the regression before sending alert.
And this number can be 1-3 commit points.
Regarding the drawback of not detecting "there is visible performance
regression within benchmark noise", my proposal is to either exclude noisy
benchmark completely, or we can require the regression to be 2X the noise (the
ratio is also tunable). These sound like a reasonable practical solution, right?
BTW, I don't think we will be able to have perfect regression detection without
any drawback(e.g. 0 false positive and 0 false negative). The question is
whether the proposed solution can be useful enough (i.e. low false positive and
low false negative) and whether it is the best solution across all available
choices. So it can be OK if some regression is not detected, like the one
mentioned above
BTW, regarding the noisy benchmark mentioned above, I am curious how
Kolmogorov-Smirnov test can address issue. Maybe I can update my proposal to
re-use the idea. Can you help explain it?
I will take a look at the tooling mentioned above later to see if we can learn
from them or re-use them.
> Improve benchmark stability
> ---------------------------
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
> Issue Type: Improvement
> Components: Benchmarks
> Affects Versions: 1.17.0
> Reporter: Yanfei Lei
> Assignee: Yanfei Lei
> Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false
> positives and false negatives, especially for benchmarks with small absolute
> values, small value changes would cause large percentage changes. see
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
> for details.
> And all benchmarks are executed on one physical machine, it might happen that
> hardware issues affect performance, like "[FLINK-18614] Performance
> regression 2020.07.13".
>
> This ticket aims to improve the precision and recall of the regression-check
> script.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)