Re: [ANNOUNCE] Performance Daily Monitoring Moved from Ververica to Apache Flink Slack Channel

Yanfei Lei Tue, 29 Nov 2022 05:13:01 -0800

Hi Martijn,

Thanks for bringing this up.


In the past two months, this channel has helped us find many benchmark fail
issues, like FLINK-29883
<https://issues.apache.org/jira/browse/FLINK-29883>[1],
FLINK-29886 <https://issues.apache.org/jira/browse/FLINK-29886>[2],
FLINK-30015 <https://issues.apache.org/jira/browse/FLINK-30015>[3] and
FLINK-30181 <https://issues.apache.org/jira/browse/FLINK-30181>[4]. I also
have tried investigating several of the frequently reported regressions and
replied under the notification in slack channel(copy them here):

   1. serializerHeavyString
   
<http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200>:
   It is unstable for a long time, see [5]
   https://issues.apache.org/jira/browse/FLINK-27165 for possible reasons.
   2. Regressions are detected by a simple script which may have false
   positives and false negatives, especially for benchmarks with small
   absolute values, small value changes cause large percentage changes. see
   [6] for details.

     Maybe slidingWindow
<http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=slidingWindow&extr=on&quarts=on&equid=off&env=2&revs=200>(value~=600),
stateBackends.ROCKS
<http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=stateBackends.ROCKS&extr=on&quarts=on&equid=off&env=2&revs=200>
(value~=260) and serializerHeavyString
<http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200>(value~=170)
are
not true regressions.

   1. For deployAllTasks.STREAMING
   
<http://codespeed.dak8s.net:8000/timeline/#/?exe=8&ben=deployAllTasks.STREAMING&extr=on&quarts=on&equid=off&env=2&revs=200>,
   this benchmark result is how much time it takes to deploy job, the less
   value the better performance, see [7] for details. FLINK-27571
   <https://issues.apache.org/jira/browse/FLINK-27571>[8] would fix this
   problem.


As mentioned before, regressions are detected by a simple script that is
less stable, FLINK-29825 <https://issues.apache.org/jira/browse/FLINK-29825>[9]
is created to improve the benchmark's stability. I planned to invite more
volunteers to monitor it after the checking of regression became more
stable, but I've been stuck with something else lately, sorry for the late
response.  Any suggestions on handling benchmark regressions/fails are
welcome.

[1] https://issues.apache.org/jira/browse/FLINK-29883

[2] https://issues.apache.org/jira/browse/FLINK-29886

[3] https://issues.apache.org/jira/browse/FLINK-30015

[4] https://issues.apache.org/jira/browse/FLINK-30181

[5] https://issues.apache.org/jira/browse/FLINK-27165

[6]
https://github.com/apache/flink-benchmarks/blob/master/regression_report.py#L132-L136

[7]
https://github.com/apache/flink-benchmarks/blob/master/src/main/java/org/apache/flink/scheduler/benchmark/deploying/DeployingTasksInStreamingJobBenchmarkExecutor.java#L58

[8] https://issues.apache.org/jira/browse/FLINK-27571

[9] https://issues.apache.org/jira/browse/FLINK-29825


Best,

Yanfei

Martijn Visser <[email protected]> 于2022年11月29日周二 15:54写道：

> Hi,
>
> Is there any update to be expected on the benchmark? I see results of the
> benchmark being posted to Slack, but it appears that it's not being
> monitored and no follow-up actions are being taken. I think it's currently
> lacking a process on how to interpret the results and what action should
> be taken and by whom.
>
> Best regards,
>
> Martijn
>
> On Thu, Nov 3, 2022 at 12:22 PM Jing Ge <[email protected]> wrote:
>
> > Thanks yanfei for driving this!
> >
> > Looking forward to further discussion w.r.t. the workflow.
> >
> > Best regards,
> > Jing
> >
> > On Mon, Oct 31, 2022 at 6:04 PM Mason Chen <[email protected]>
> wrote:
> >
> > > +1, thanks for driving this!
> > >
> > > On a side note, can we also ensure that a performance summary report
> for
> > > Flink major version upgrades is in release notes, once this
> > infrastructure
> > > becomes mature? From the user perspective, it would be nice to know
> what
> > > the expected (or unexpected) regressions in a major version upgrade
> are.
> > > I've seen the community do something like this before (e.g. the major
> > > rocksdb version bump in 1.14?) and it was quite valuable to know that
> > > upfront!
> > >
> > > Best,
> > > Mason
> > >
> > > On Fri, Oct 28, 2022 at 1:46 AM weijie guo <[email protected]>
> > > wrote:
> > >
> > > > Thanks Yanfei for driving this.
> > > >
> > > > It allows us to easily find the problem of performance regression.
> > > > Especially recently, I have made some improvements to the scheduling
> > > > related parts, your work is very important to ensure that these
> changes
> > > do
> > > > not cause some unexpected problems.
> > > >
> > > > Best regards,
> > > >
> > > > Weijie
> > > >
> > > >
> > > > Congxian Qiu <[email protected]> 于2022年10月28日周五 16:03写道：
> > > >
> > > > > Thanks for driving this and making the performance monitoring
> public,
> > > > this
> > > > > can make us know and resolve the performance problem quickly.
> > > > >
> > > > > Looking forward to the workflow and detailed descriptions fo
> > > > > flink-dev-benchmarks.
> > > > >
> > > > > Best,
> > > > > Congxian
> > > > >
> > > > >
> > > > > Yun Tang <[email protected]> 于2022年10月27日周四 12:41写道：
> > > > >
> > > > > > Thanks, Yanfei for driving this to monitor the performance in the
> > > > Apache
> > > > > > Flink Slack Channel.
> > > > > >
> > > > > > Look forward to the workflow and detailed descriptions of
> > > > > > flink-dev-benchmarks.
> > > > > >
> > > > > > Best
> > > > > > Yun Tang
> > > > > > ________________________________
> > > > > > From: Hangxiang Yu <[email protected]>
> > > > > > Sent: Thursday, October 27, 2022 10:59
> > > > > > To: [email protected] <[email protected]>
> > > > > > Subject: Re: [ANNOUNCE] Performance Daily Monitoring Moved from
> > > > Ververica
> > > > > > to Apache Flink Slack Channel
> > > > > >
> > > > > > Hi, Yanfei.
> > > > > > Thanks for driving this.
> > > > > > It could help us to detect and resolve the regression problem
> > quickly
> > > > and
> > > > > > officially.
> > > > > > I'd like to join as a maintainer.
> > > > > > Looking forward to the workflow.
> > > > > >
> > > > > > On Wed, Oct 26, 2022 at 5:18 PM Yuan Mei <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Thanks, Yanfei, to drive this and make the performance
> monitoring
> > > > > > publicly
> > > > > > > available.
> > > > > > >
> > > > > > > Looking forward to seeing the workflow, and more details as
> > Martijn
> > > > > > > mentioned.
> > > > > > >
> > > > > > > Best
> > > > > > > Yuan
> > > > > > >
> > > > > > > On Wed, Oct 26, 2022 at 2:59 PM Martijn Visser <
> > > > > [email protected]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Yanfei Lei,
> > > > > > > >
> > > > > > > > Thanks for setting this up! It would be interesting to also
> > know
> > > > > which
> > > > > > > > aspects of Flink are monitored for "performance". I'm
> assuming
> > > > there
> > > > > > are
> > > > > > > > specific pieces of functionality that are performance tested,
> > but
> > > > it
> > > > > > > would
> > > > > > > > be great if this would be written down somewhere (next to a
> > > > procedure
> > > > > > how
> > > > > > > > to detect a regression and what should be next steps).
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > > Martijn
> > > > > > > >
> > > > > > > > On Wed, Oct 26, 2022 at 8:21 AM Zakelly Lan <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi yanfei,
> > > > > > > > >
> > > > > > > > > Thanks for driving this! It's a great help.
> > > > > > > > >
> > > > > > > > > I would like to join as a maintainer.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Zakelly
> > > > > > > > >
> > > > > > > > > On Wed, Oct 26, 2022 at 11:32 AM yanfei lei <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > As discussed earlier, we plan to create a benchmark
> channel
> > > in
> > > > > > Apache
> > > > > > > > > Flink
> > > > > > > > > > slack[1], but the plan was shelved for a while[2]. So I
> > went
> > > on
> > > > > > with
> > > > > > > > this
> > > > > > > > > > work, and created the #flink-dev-benchmarks channel for
> > > > > performance
> > > > > > > > > > regression notifications.
> > > > > > > > > >
> > > > > > > > > > We have a regression report script[3] that runs daily,
> and
> > a
> > > > > > > > notification
> > > > > > > > > > would be sent to the slack channel when the last few
> > > benchmark
> > > > > > > results
> > > > > > > > > are
> > > > > > > > > > significantly worse than the baseline.
> > > > > > > > > > Note, regressions are detected by a simple script which
> may
> > > > have
> > > > > > > false
> > > > > > > > > > positives and false negatives. And all benchmarks are
> > > executed
> > > > on
> > > > > > one
> > > > > > > > > > physical machine[4] which is provided by
> > > Ververica(Alibaba)[5],
> > > > > it
> > > > > > > > might
> > > > > > > > > > happen that hardware issues affect performance, like
> > > > > "[FLINK-18614
> > > > > > > > > > <https://issues.apache.org/jira/browse/FLINK-18614>]
> > > > Performance
> > > > > > > > > regression
> > > > > > > > > > 2020.07.13"[6].
> > > > > > > > > >
> > > > > > > > > > After the migration, we need a procedure to watch over
> the
> > > > entire
> > > > > > > > > > performance of Flink code together. For example, if a
> > > > regression
> > > > > > > > > > occurs, investigating the cause and resolving the problem
> > are
> > > > > > needed.
> > > > > > > > In
> > > > > > > > > > the past, this procedure is maintained internally within
> > > > > Ververica,
> > > > > > > but
> > > > > > > > > we
> > > > > > > > > > think making the procedure public would benefit all. I
> > > > volunteer
> > > > > to
> > > > > > > > serve
> > > > > > > > > > as one of the initial maintainers, and would be glad if
> > more
> > > > > > > > contributors
> > > > > > > > > > can join me. I'd also prepare some guidelines to help
> > others
> > > > get
> > > > > > > > familiar
> > > > > > > > > > with the workflow. I will start a new thread to discuss
> the
> > > > > > workflow
> > > > > > > > > soon.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > https://www.mail-archive.com/[email protected]/msg58666.html
> > > > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-28468
> > > > > > > > > > [3]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink-benchmarks/blob/master/regression_report.py
> > > > > > > > > > [4] http://codespeed.dak8s.net:8080
> > > > > > > > > > [5]
> > > > > > https://lists.apache.org/thread/jzljp4233799vwwqnr0vc9wgqs0xj1ro
> > > > > > > > > >
> > > > > > > > > > [6] https://issues.apache.org/jira/browse/FLINK-18614
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best,
> > > > > > Hangxiang.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [ANNOUNCE] Performance Daily Monitoring Moved from Ververica to Apache Flink Slack Channel

Reply via email to