[jira] [Comment Edited] (FLINK-29825) Improve benchmark stability

Dong Lin (Jira) Sat, 04 Feb 2023 07:58:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684183#comment-17684183
 ]


Dong Lin edited comment on FLINK-29825 at 2/4/23 3:57 PM:
----------------------------------------------------------

According to the wiki, two sample Kolmogorov-Smirnov test is used to determine 
whether two distributions (i.e. collection of values) are close enough. On the 
other hand, regression detection is more about determining whether a single 
value (e.g. latest performance) is observably worse than the the best 
performance in the past. These are two quite different problems.

Is there any success story of using Kolmogorov-Smirnov to detect regression in 
practice?

I drafted this 
[doc|https://docs.google.com/document/d/1Bvzvq79Ll5yxd1UtC0YzczgFbZPAgPcN3cI0MjVkIag]
 to explain the algorithm that I would like to try to detect Flink regression. 
It is not exactly the same as the one I used before for TensorFlow (because I 
lost that doc) but the ideas are pretty much the same. Using the heuristics 
described in this doc, I am confident it should have much lower false positive 
rate than the relatively simple formula used in the existing 
[script|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py].

The parameters (e.g. threshold for regression detection) of this algorithm need 
to be tuned based on the benchmark data.

Hopefully I can get time to implement and evaluate this algorithm in the coming 
2 weeks. The main issue that I don't know how to address yet is how to update 
the script to get the maximum and deviation of throughput across multiple runs 
for a given commit point.


was (Author: lindong):
According to the wiki, two sample Kolmogorov-Smirnov test is used to determine 
whether two distributions (i.e. collection of values) are close enough. On the 
other hand, regression detection is more about determining whether a single 
value (e.g. latest performance) is observably worse than the the best 
performance in the past. These are two quite different problems.

Is there any success story of using Kolmogorov-Smirnov to detect regression in 
practice?

I drafted this doc 
([https://docs.google.com/document/d/1Bvzvq79Ll5yxd1UtC0YzczgFbZPAgPcN3cI0MjVkIag/|https://docs.google.com/document/d/1Bvzvq79Ll5yxd1UtC0YzczgFbZPAgPcN3cI0MjVkIag/[])|https://docs.google.com/document/d/1Bvzvq79Ll5yxd1UtC0YzczgFbZPAgPcN3cI0MjVkIag)]
 to explain the algorithm that I would like to try to detect Flink regression. 
It is not exactly the same as the one I used before for TensorFlow (because I 
lost that doc) but the ideas are pretty much the same. Using the heuristics 
described in this doc, I am confident it should have much lower false positive 
rate than the relatively simple formula used in the existing script 
([https://github.com/apache/flink-benchmarks/blob/master/regression_report.py).|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py.]

The parameters (e.g. threshold for regression detection) of this algorithm need 
to be tuned based on the benchmark data.

Hopefully I can get time to implement and evaluate this algorithm in the coming 
2 weeks. The main issue that I don't know how to address yet is how to update 
the script to get the maximum and deviation of throughput across multiple runs 
for a given commit point.

> Improve benchmark stability
> ---------------------------
>
>                 Key: FLINK-29825
>                 URL: https://issues.apache.org/jira/browse/FLINK-29825
>             Project: Flink
>          Issue Type: Improvement
>          Components: Benchmarks
>    Affects Versions: 1.17.0
>            Reporter: Yanfei Lei
>            Assignee: Yanfei Lei
>            Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-29825) Improve benchmark stability

Reply via email to