Re: Monitoring performance for releases

Maximilian Michels Tue, 21 Jul 2020 07:23:45 -0700

It doesn't support https. I had to add an exception to the HTTPS Everywhere extension for 
"metrics.beam.apache.org".

*facepalm* Thanks Udi! It would always hang on me because I use HTTPSEverywhere.

To be explicit, I am supporting the idea of reviewing the release guide but not 
changing the release process for the already in-progress release.

I consider the release guide immutable for the process of a release.Thus, a change to the release guide can only affect new upcomingreleases, not an in-process release.

+1 and I think we can also evaluate whether flaky tests should be reviewed as 
release blockers or not. Some flaky tests would be hiding real issues our users 
could face.

Flaky tests are also worth to take into account when releasing, but alittle harder to find because may just happen to pass during buildingthe release. It is possible though if we strictly capture flaky testsvia JIRA and mark them with the Fix Version for the release.

We keep accumulating dashboards and
tests that few people care about, so it is probably worth that we use
them or get a way to alert us of regressions during the release cycle
to catch this even before the RCs.

+1 The release guide should be explicit about which performance testresults to evaluate.

The prerequisite is that we have all the stats in one place. They seemto be scattered across http://metrics.beam.apache.org andhttps://apache-beam-testing.appspot.com.

Would it be possible to consolidate the two, i.e. use the Grafana-baseddashboard to load the legacy stats?

For the evaluation during the release process, I suggest to use astandardized set of performance tests for all runners, e.g.:


- Nexmark
- ParDo (Classic/Portable)
- GroupByKey
- IO


-Max

On 21.07.20 01:23, Ahmet Altay wrote:

On Mon, Jul 20, 2020 at 3:07 PM Ismaël Mejía <[email protected]<mailto:[email protected]>> wrote:


    +1

    This is not in the release guide and we should probably re evaluate if
    this should be a release blocking reason.
    Of course exceptionally a performance regression could be motivated by
    a correctness fix or a worth refactor, so we should consider this.

+1 and I think we can also evaluate whether flaky tests should bereviewed as release blockers or not. Some flaky tests would be hidingreal issues our users could face.

To be explicit, I am supporting the idea of reviewing the release guidebut not changing the release process for the already in-progress release.



    We have been tracking and fixing performance regressions multiple
    times found simply by checking the nexmark tests including on the
    ongoing 2.23.0 release so value is there. Nexmark does not cover yet
    python and portable runners so we are probably still missing many
    issues and it is worth to work on this. In any case we should probably
    decide what validations matter. We keep accumulating dashboards and
    tests that few people care about, so it is probably worth that we use
    them or get a way to alert us of regressions during the release cycle
    to catch this even before the RCs.

I agree. And if we cannot use dashboards/tests in a meaningful way, IMOwe can remove them. There is not much value to maintain them if they donot provide important signals.



    On Fri, Jul 10, 2020 at 9:30 PM Udi Meiri <[email protected]
    <mailto:[email protected]>> wrote:
     >
     > On Thu, Jul 9, 2020 at 12:48 PM Maximilian Michels
    <[email protected] <mailto:[email protected]>> wrote:
     >>
     >> Not yet, I just learned about the migration to a new frontend,
    including
     >> a new backend (InfluxDB instead of BigQuery).
     >>
     >> >  - Are the metrics available on metrics.beam.apache.org
    <http://metrics.beam.apache.org>?
     >>
     >> Is http://metrics.beam.apache.org online? I was never able to
    access it.
     >
     >
     > It doesn't support https. I had to add an exception to the HTTPS
    Everywhere extension for "metrics.beam.apache.org
    <http://metrics.beam.apache.org>".
     >
     >>
     >>
     >> >  - What is the feature delta between usinig
    metrics.beam.apache.org <http://metrics.beam.apache.org> (much
    better UI) and using apache-beam-testing.appspot.com
    <http://apache-beam-testing.appspot.com>?
     >>
     >> AFAIK it is an ongoing migration and the delta appears to be high.
     >>
     >> >  - Can we notice regressions faster than release cadence?
     >>
     >> Absolutely! A report with the latest numbers including
    statistics about
     >> the growth of metrics would be useful.
     >>
     >> >  - Can we get automated alerts?
     >>
     >> I think we could setup a Jenkins job to do this.
     >>
     >> -Max
     >>
     >> On 09.07.20 20:26, Kenneth Knowles wrote:
     >> > Questions:
     >> >
     >> >   - Are the metrics available on metrics.beam.apache.org
    <http://metrics.beam.apache.org>
     >> > <http://metrics.beam.apache.org>?
     >> >   - What is the feature delta between usinig
    metrics.beam.apache.org <http://metrics.beam.apache.org>
     >> > <http://metrics.beam.apache.org> (much better UI) and using
     >> > apache-beam-testing.appspot.com
    <http://apache-beam-testing.appspot.com>
    <http://apache-beam-testing.appspot.com>?
     >> >   - Can we notice regressions faster than release cadence?
     >> >   - Can we get automated alerts?
     >> >
     >> > Kenn
     >> >
     >> > On Thu, Jul 9, 2020 at 10:21 AM Maximilian Michels
    <[email protected] <mailto:[email protected]>
     >> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
     >> >
     >> >     Hi,
     >> >
     >> >     We recently saw an increase in latency migrating from Beam
    2.18.0 to
     >> >     2.21.0 (Python SDK with Flink Runner). This proofed very
    hard to debug
     >> >     and it looks like each version in between the two versions
    let to
     >> >     increased latency.
     >> >
     >> >     This is not the first time we saw issues when migrating,
    another
     >> >     time we
     >> >     had a decline in checkpointing performance and thus added a
     >> >     checkpointing test [1] and dashboard [2] (see
    checkpointing widget).
     >> >
     >> >     That makes me wonder if we should monitor performance
    (throughput /
     >> >     latency) for basic use cases as part of the release
    testing. Currently,
     >> >     our release guide [3] mentions running examples but not
    evaluating the
     >> >     performance. I think it would be good practice to check
    relevant charts
     >> >     with performance measurements as part of of the release
    process. The
     >> >     release guide should reflect that.
     >> >
     >> >     WDYT?
     >> >
     >> >     -Max
     >> >
     >> >     PS: Of course, this requires tests and metrics to be
    available. This PR
     >> >     adds latency measurements to the load tests [4].
     >> >
     >> >
     >> >     [1] https://github.com/apache/beam/pull/11558
     >> >     [2]
     >> >
    https://apache-beam-testing.appspot.com/explore?dashboard=5751884853805056
     >> >     [3] https://beam.apache.org/contribute/release-guide/
     >> >     [4] https://github.com/apache/beam/pull/12065
     >> >

Re: Monitoring performance for releases

Reply via email to