[GitHub] [beam] damccorm opened a new issue, #20707: Automated Release Performance Benchmark Regression/Improvement comparison

GitBox Sat, 04 Jun 2022 12:02:50 -0700


damccorm opened a new issue, #20707:
URL: https://github.com/apache/beam/issues/20707

While running the release, we have a step that has us check for Performance
Regressions for our releases.
[https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions](https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions)

However, what we're able to check is the measured graphs over time. We don't
have a clear indication of what these metrics were for the last release, only
able to see vague trends in the line graph. To be clear, the line graph is
excellent at seeing large sudden changes, or small changes over a large amount
of time, it doesn't help the release manager very well.

For one: infra might have changed in the mean time, such as compilers, test
machine hardware, and load variables, along with the benchmarking code itself
which makes comparing any two points in those graphs very difficult. Worst,
they are only ever single runs, which puts them at the mercy of variance.
Changes that are invariably good in all cases are difficult.

This Jira proposes that we should make it possible to reproducibly
performance test and compare two releases. In addition, we should be able to
publish the results of our benchmarks along with the rest of the release
artifacts, along with the comparison to the previous release.

Obvious caveat: If there are new tests that can't run on the previous
release, (or old tests that can't run on the new release) they're free to be
excluded. This can be automatic by tagging the tests somehow, or publish
explicit manual exclusions or inclusions. This implies that the tests are user
side, and rely on a given set of released SDK or Runner artifacts for execution.

Ideally the release manager can run a good chunk of these tests on their
local machine, or a host in the cloud. Any such cloud resources should be
identical for before and after comparisons. Eg. If one is comparing Flink
performance, then the same machine types should be used to compare Beam version
X and X****1.

As inspiration, a Go tool called Benchstat does what I'm talking about for
the Go Benchmark format. See the descriptions in the documentation here:
[https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme](https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme)

It takes the results from 1 or more runs of the given benchmark (measuring
time per operation, or memory thoughput, or allocs per operation etc), on the
old system, and the same from the new system, and produces averages and deltas.
These are in a suitable format

eg.

`$ benchstat old.txt new.txt`
`name old time/op new time/op delta`
`GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)`
`JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)`

This would be a valuable way to produce and present results for users, and
to more easily validate what performance characteristics have changed between
versions.

Given the size and breadth and distributed nature of Beam and associated
infrastructure, this is something we likely only wish to do along with the
release. It will likely be time consuming, and for larger scale load tests on
cloud resources, expensive. In order to make meaningful comparisons, as much
as possible needs to be invariant between the releases under comparison.

In particular: if running on a distributed set of resources (eg cloud
cluster) the machine type and numbers should remain invariant (Spark and Flink
clusters should be the same size, dataflow being different is trickier but
should be unrestricted, as that's the point.) Local tests on a single machine
are comparable by themselves as well.

Included in the publishing, the specifics of the machine(s) being run on
should be included (CPU, clock, RAM amount, # of machines if distributed,
official cloud designation if using cloud provider VMs (AKA machine types, like
e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).

Overall goal is to be able to run the comparisons on a local machine, and be
able to send jobs to clusters in clouds. Actual provisioning of cloud resources
is a non-goal of this proposal.

Given a (set) of tests, we should be able to generate a text file with the
results, for collation similar to what go's benchstat does. Bonus points if we
can have benchstat handle the task for us without modification.

Similar to our release validation scripts, a release manager (or any user)
should be able to access and compare results.

eg. ./release_comparison.sh \{$OLD_VERSION} \{$NEW_VERSION}

It must be able to support Release Candidate versions.

Adding this kind of infrastructure will improve trust in Beam, beam
releases, and allow others to more consistently compare performance results.

This Jira stands as a proposal and if accepted, a place for discussion, and
hanging subtasks and specifics.

A side task that could be useful would be to be able to generate these text
file versions of the benchmarks from querying the metrics database. Then the
comparisons can be a few datapoints from around a given time point, to another,
which at least make the release managers job a little easier, though that
doesn't compare two releases.

Imported from Jira
[BEAM-11431](https://issues.apache.org/jira/browse/BEAM-11431). Original Jira
may contain additional context.
Reported by: lostluck.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #20707: Automated Release Performance Benchmark Regression/Improvement comparison

Reply via email to