damccorm opened a new issue, #20707:
URL: https://github.com/apache/beam/issues/20707

   While running the release, we have a step that has us check for Performance 
Regressions for our releases.  
[https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions](https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions)
 
   
   However, what we're able to check is the measured graphs over time. We don't 
have a clear indication of what these metrics were for the last release, only 
able to see vague trends in the line graph.  To be clear, the line graph is 
excellent at seeing large sudden changes, or small changes over a large amount 
of time, it doesn't help the release manager very well.
   
   For one: infra might have changed in the mean time, such as compilers, test 
machine hardware, and load variables, along with the benchmarking code itself 
which makes comparing any two points in those graphs very difficult. Worst, 
they are only ever single runs, which puts them at the mercy of variance. 
Changes that are invariably good in all cases are difficult.
   
   This Jira proposes that we should make it possible to reproducibly 
performance test and compare two releases. In addition, we should be able to 
publish the results of our benchmarks along with the rest of the release 
artifacts, along with the comparison to the previous release.
   
   Obvious caveat: If there are new tests that can't run on the previous 
release, (or old tests that can't run on the new release) they're free to be 
excluded. This can be automatic by tagging the tests somehow, or publish 
explicit manual exclusions or inclusions. This implies that the tests are user 
side, and rely on a given set of released SDK or Runner artifacts for execution.
   
   Ideally the release manager can run a good chunk of these tests on their 
local machine, or a host in the cloud. Any such cloud resources should be 
identical for before and after comparisons. Eg. If one is comparing Flink 
performance, then the same machine types should be used to compare Beam version 
X and X****1.
   
   As inspiration, a Go tool called Benchstat does what I'm talking about for 
the Go Benchmark format. See the descriptions in the documentation here: 
[https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme](https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme)
 
   
   It takes the results from 1 or more runs of the given benchmark (measuring 
time per operation, or memory thoughput, or allocs per operation etc), on the 
old system, and the same from the new system, and produces averages and deltas. 
These are in a suitable format 
   
   eg.
   
   `$ benchstat old.txt new.txt`
    `name old time/op new time/op delta`
    `GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)`
    `JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)`
   
   This would be a valuable way to produce and present results for users, and 
to more easily validate what performance characteristics have changed between 
versions.
   
   Given the size and breadth and distributed nature of Beam and associated 
infrastructure, this is something we likely only wish to do along with the 
release. It will likely be time consuming, and for larger scale load tests on 
cloud resources, expensive. In order to make meaningful  comparisons, as much 
as possible needs to be invariant between the releases under comparison.
   
   In particular: if running on a distributed set of resources (eg cloud 
cluster) the machine type and numbers should remain invariant (Spark and Flink 
clusters should be the same size, dataflow being different is trickier but 
should be unrestricted, as that's the point.) Local tests on a single machine 
are comparable by themselves as well.
   
   Included in the publishing, the specifics of the machine(s) being run on 
should be included (CPU, clock, RAM amount, # of machines if distributed, 
official cloud designation if using cloud provider VMs (AKA machine types, like 
e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).
   
   Overall goal is to be able to run the comparisons on a local machine, and be 
able to send jobs to clusters in clouds. Actual provisioning of cloud resources 
is a non-goal of this proposal.
   
   Given a (set) of tests, we should be able to generate a text file with the 
results, for collation similar to what go's benchstat does. Bonus points if we 
can have benchstat handle the task for us without modification.
   
   Similar to our release validation scripts, a release manager (or any user) 
should be able to access and compare results.
   
   eg. ./release_comparison.sh \{$OLD_VERSION} \{$NEW_VERSION}
   
   It must be able to support Release Candidate versions.
   
   Adding this kind of infrastructure will improve trust in Beam, beam 
releases, and allow others to more consistently compare performance results.
   
   This Jira stands as a proposal and if accepted, a place for discussion, and 
hanging subtasks and specifics.
   
   A side task that could be useful would be to be able to generate these text 
file versions of the benchmarks from querying the metrics database. Then the 
comparisons can be a few datapoints from around a given time point, to another, 
which at least make the release managers job a little easier, though that 
doesn't compare two releases.
   
   Imported from Jira 
[BEAM-11431](https://issues.apache.org/jira/browse/BEAM-11431). Original Jira 
may contain additional context.
   Reported by: lostluck.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to