[jira] [Comment Edited] (FLINK-29825) Improve benchmark stability

Piotr Nowojski (Jira) Tue, 07 Feb 2023 08:07:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685367#comment-17685367
 ]


Piotr Nowojski edited comment on FLINK-29825 at 2/7/23 4:06 PM:
----------------------------------------------------------------

[~lindong], I don't think having to wait a couple of days to sometimes (for 
noisy benchmarks) to reliably detect a performance regression is an issue. We 
can not run regressions checks per each PR before PR is merged, so it really 
doesn't matter much if the regression will be detected 12h after merging or 72h 
after merging.  

Thanks for the investigation [~Yanfei Lei]. As I said, I have a feeling we 
should be able to find a better, more sophisticated solution, but at the same 
time I can not dive deeper into this myself. I would encourage one of you to 
take a look at the Hunter tool that I mentioned above, and maybe include it in 
the comparison. But at the same time if you are strongly inclined towards 
[~lindong]'s idea, I wouldn't block it, as it's indeed most likely an 
improvement over what we have right now.

{quote}
BTW, regarding the noisy benchmark mentioned above, I am curious how 
Kolmogorov-Smirnov test can address issue. Maybe I can update my proposal to 
re-use the idea. Can you help explain it?
{quote}
I've just realised that my naive idea (basically comparing integrals of two 
EDFs) would be unable to detect if benchmark suddenly became very noisy, 
maintaining. the same average/mean. I think I would need to think/do some 
research how to clarify my thoughts. Roughly speaking I wanted to run some 
comparison on two empirical distribution functions. Human via looking at two 
EDFs can very easily detect that they are coming from two different 
distributions:
https://i0.wp.com/statisticsbyjim.com/wp-content/uploads/2021/06/empirical_cdf_plot_multiple.png?w=576&ssl=1


was (Author: pnowojski):
[~lindong], I don't think having to wait a couple of days to sometimes (for 
noisy benchmarks) to reliably detect a performance regression is an issue. We 
can not run regressions checks per each PR before PR is merged, so it really 
doesn't matter much if the regression will be detected 12h after merging or 72h 
after merging.  

Thanks for the investigation [~Yanfei Lei]. As I said, I have a feeling we 
should be able to find a better, more sophisticated solution, but at the same 
time I can not dive deeper into this myself. I would encourage one of you to 
take a look at the Hunter tool that I mentioned above, and maybe include it in 
the comparison. But at the same time if you are strongly inclined towards 
[~lindong]'s idea, I wouldn't block it, as it's indeed most likely an 
improvement over what we have right now.

{quote}
Can you help explain it?
{quote}
I've just realised that my naive idea (basically comparing integrals of two 
EDFs) would be unable to detect if benchmark suddenly became very noisy. I 
think I would need to think/do some research how to clarify my thoughts. 
Roughly speaking I wanted to run some comparison on two empirical distribution 
functions. Human via looking at two EDFs can very easily detect that they are 
coming from two different distributions:
https://i0.wp.com/statisticsbyjim.com/wp-content/uploads/2021/06/empirical_cdf_plot_multiple.png?w=576&ssl=1

> Improve benchmark stability
> ---------------------------
>
>                 Key: FLINK-29825
>                 URL: https://issues.apache.org/jira/browse/FLINK-29825
>             Project: Flink
>          Issue Type: Improvement
>          Components: Benchmarks
>    Affects Versions: 1.17.0
>            Reporter: Yanfei Lei
>            Assignee: Yanfei Lei
>            Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false 
> positives and false negatives, especially for benchmarks with small absolute 
> values, small value changes would cause large percentage changes. see 
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
>  for details.
> And all benchmarks are executed on one physical machine, it might happen that 
> hardware issues affect performance, like "[FLINK-18614] Performance 
> regression 2020.07.13".
>  
> This ticket aims to improve the precision and recall of the regression-check 
> script.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-29825) Improve benchmark stability

Reply via email to