[ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684999#comment-17684999
 ] 

Dong Lin commented on FLINK-29825:
----------------------------------

[~pnowojski] I think one drawback with your proposal is that it is comparing 
two distributions and depends on having large enough number of samples in both 
distributions. It means that if a benchmark happens, you need to run engouh 
commit-points so that the recent distribution starts to be considerably 
different from the previous distribution according to Kolmogorov-Smirnov test. 
This would considerably delay the time-to-regression-detection. It seems that 
my proposal would not suffer from this issue since it allows users to specify 
how many commit-points we need to repeat the regression before sending alert. 
And this number can be 1-3 commit points.

Regarding the drawback of not detecting "there is visible performance 
regression within benchmark noise", my proposal is to either exclude noisy 
benchmark completely, or we can require the regression to be 2X the noise (the 
ratio is also tunable). These sound like a reasonable practical solution, right?

BTW, I don't think we will be able to have perfect regression detection without 
any drawback(e.g. 0 false positive and 0 false negative). The question is 
whether the proposed solution can be useful enough (i.e. low false positive and 
low false negative) and whether it is the best solution across all available 
choices. So it can be OK if some regression is not detected, like the one 
mentioned above


BTW, regarding the noisy benchmark mentioned above, I am curious how 
Kolmogorov-Smirnov test can address issue. Maybe I can update my proposal to 
re-use the idea. Can you help explain it?

I will take a look at the tooling mentioned above later to see if we can learn 
from them or re-use them.

> Improve benchmark stability
> ---------------------------
>
>                 Key: FLINK-29825
>                 URL: https://issues.apache.org/jira/browse/FLINK-29825
>             Project: Flink
>          Issue Type: Improvement
>          Components: Benchmarks
>    Affects Versions: 1.17.0
>            Reporter: Yanfei Lei
>            Assignee: Yanfei Lei
>            Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to