[
https://issues.apache.org/jira/browse/FLINK-29825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685367#comment-17685367
]
Piotr Nowojski edited comment on FLINK-29825 at 2/7/23 4:05 PM:
----------------------------------------------------------------
[~lindong], I don't think having to wait a couple of days to sometimes (for
noisy benchmarks) to reliably detect a performance regression is an issue. We
can not run regressions checks per each PR before PR is merged, so it really
doesn't matter much if the regression will be detected 12h after merging or 72h
after merging.
Thanks for the investigation [~Yanfei Lei]. As I said, I have a feeling we
should be able to find a better, more sophisticated solution, but at the same
time I can not dive deeper into this myself. I would encourage one of you to
take a look at the Hunter tool that I mentioned above, and maybe include it in
the comparison. But at the same time if you are strongly inclined towards
[~lindong]'s idea, I wouldn't block it, as it's indeed most likely an
improvement over what we have right now.
{quote}
Can you help explain it?
{quote}
I've just realised that my naive idea (basically comparing integrals of two
EDFs) would be unable to detect if benchmark suddenly became very noisy. I
think I would need to think/do some research how to clarify my thoughts.
Roughly speaking I wanted to run some comparison on two empirical distribution
functions. Human via looking at two EDFs can very easily detect that they are
coming from two different distributions:
https://i0.wp.com/statisticsbyjim.com/wp-content/uploads/2021/06/empirical_cdf_plot_multiple.png?w=576&ssl=1
was (Author: pnowojski):
[~lindong], I don't think having to wait a couple of days to sometimes (for
noisy benchmarks) to reliably detect a performance regression is an issue. We
can not run regressions checks per each PR before PR is merged, so it really
doesn't matter much if the regression will be detected 12h after merging or 72h
after merging.
Thanks for the investigation [~Yanfei Lei]. As I said, I have a feeling we
should be able to find a better, more sophisticated solution, but at the same
time I can not dive deeper into this myself. I would encourage one of you to
take a look at the Hunter tool that I mentioned above, and maybe include it in
the comparison. But at the same time if you are strongly inclined towards
[~lindong]'s idea, I wouldn't block it, as it's indeed most likely an
improvement over what we have right now.
> Improve benchmark stability
> ---------------------------
>
> Key: FLINK-29825
> URL: https://issues.apache.org/jira/browse/FLINK-29825
> Project: Flink
> Issue Type: Improvement
> Components: Benchmarks
> Affects Versions: 1.17.0
> Reporter: Yanfei Lei
> Assignee: Yanfei Lei
> Priority: Minor
>
> Currently, regressions are detected by a simple script which may have false
> positives and false negatives, especially for benchmarks with small absolute
> values, small value changes would cause large percentage changes. see
> [here|https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136]
> for details.
> And all benchmarks are executed on one physical machine, it might happen that
> hardware issues affect performance, like "[FLINK-18614] Performance
> regression 2020.07.13".
>
> This ticket aims to improve the precision and recall of the regression-check
> script.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)