[
https://issues.apache.org/jira/browse/BEAM-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Burke updated BEAM-11431:
--------------------------------
Description:
While running the release, we have a step that has us check for Performance
Regressions for our releases.
[https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions]
However, what we're able to check is the measured graphs over time. We don't
have a clear indication of what these metrics were for the last release, only
able to see vague trends in the line graph. To be clear, the line graph is
excellent at seeing large sudden changes, or small changes over a large amount
of time, it doesn't help the release manager very well.
For one: infra might have changed in the mean time, such as compilers, test
machine hardware, and load variables, along with the benchmarking code itself
which makes comparing any two points in those graphs very difficult. Worst,
they are only ever single runs, which puts them at the mercy of variance.
Changes that are invariably good in all cases are difficult.
This Jira proposes that we should make it possible to reproducibly performance
test and compare two releases. In addition, we should be able to publish the
results of our benchmarks along with the rest of the release artifacts, along
with the comparison to the previous release.
Obvious caveat: If there are new tests that can't run on the previous release,
(or old tests that can't run on the new release) they're free to be excluded.
This can be automatic by tagging the tests somehow, or publish explicit manual
exclusions or inclusions. This implies that the tests are user side, and rely
on a given set of released SDK or Runner artifacts for execution.
Ideally the release manager can run a good chunk of these tests on their local
machine, or a host in the cloud. Any such cloud resources should be identical
for before and after comparisons. Eg. If one is comparing Flink performance,
then the same machine types should be used to compare Beam version X and X+1.
As inspiration, a Go tool called Benchstat does what I'm talking about for the
Go Benchmark format. See the descriptions in the documentation here:
[https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme]
It takes the results from 1 or more runs of the given benchmark (measuring time
per operation, or memory thoughput, or allocs per operation etc), on the old
system, and the same from the new system, and produces averages and deltas.
These are in a suitable format
eg.
{{$ benchstat old.txt new.txt}}
{{name old time/op new time/op delta}}
{{GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)}}
{{JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)}}
This would be a valuable way to produce and present results for users, and to
more easily validate what performance characteristics have changed between
versions.
Given the size and breadth and distributed nature of Beam and associated
infrastructure, this is something we likely only wish to do along with the
release. It will likely be time consuming, and for larger scale load tests on
cloud resources, expensive. In order to make meaningful comparisons, as much
as possible needs to be invariant between the releases under comparison.
In particular: if running on a distributed set of resources (eg cloud cluster)
the machine type and numbers should remain invariant (Spark and Flink clusters
should be the same size, dataflow being different is trickier but should be
unrestricted, as that's the point.) Local tests on a single machine are
comparable by themselves as well.
Included in the publishing, the specifics of the machine(s) being run on should
be included (CPU, clock, RAM amount, # of machines if distributed, official
cloud designation if using cloud provider VMs (AKA machine types, like
e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).
Overall goal is to be able to run the comparisons on a local machine, and be
able to send jobs to clusters in clouds. Actual provisioning of cloud resources
is a non-goal of this proposal.
Given a (set) of tests, we should be able to generate a text file with the
results, for collation similar to what go's benchstat does. Bonus points if we
can have benchstat handle the task for us without modification.
Similar to our release validation scripts, a release manager (or any user)
should be able to access and compare results.
eg. ./release_comparison.sh \{$OLD_VERSION} \{$NEW_VERSION}
It must be able to support Release Candidate versions.
Adding this kind of infrastructure will improve trust in Beam, beam releases,
and allow others to more consistently compare performance results.
This Jira stands as a proposal and if accepted, a place for discussion, and
hanging subtasks and specifics.
A side task that could be useful would be to be able to generate these text
file versions of the benchmarks from querying the metrics database. Then the
comparisons can be a few datapoints from around a given time point, to another,
which at least make the release managers job a little easier, though that
doesn't compare two releases.
was:
While running the release, we have a step that has us check for Performance
Regressions for our releases.
[https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions]
However, what we're able to check is the measured graphs over time. We don't
have a clear indication of what these metrics were for the last release, only
able to see vague trends in the line graph. To be clear, the line graph is
excellent at seeing large sudden changes, or small changes over a large amount
of time, it doesn't help the release manager very well.
For one: infra might have changed in the mean time, such as compilers, test
machine hardware, and load variables, along with the benchmarking code itself
which makes comparing any two points in those graphs very difficult. Worst,
they are only ever single runs, which puts them at the mercy of variance.
Changes that are invariably good in all cases are difficult.
This Jira proposes that we should make it possible to reproducibly performance
test and compare two releases. In addition, we should be able to publish the
results of our benchmarks along with the rest of the release artifacts, along
with the comparison to the previous release.
Obvious caveat: If there are new tests that can't run on the previous release,
(or old tests that can't run on the new release) they're free to be excluded.
This can be automatic by tagging the tests somehow, or publish explicit manual
exclusions or inclusions. This implies that the tests are user side, and rely
on a given set of released SDK or Runner artifacts for execution.
Ideally the release manager can run a good chunk of these tests on their local
machine, or a host in the cloud. Any such cloud resources should be identical
for before and after comparisons. Eg. If one is comparing Flink performance,
then the same machine types should be used to compare Beam version X and X+1.
As inspiration, a Go tool called Benchstat does what I'm talking about for the
Go Benchmark format. See the descriptions in the documentation here:
[https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme]
It takes the results from 1 or more runs of the given benchmark (measuring time
per operation, or memory thoughput, or allocs per operation etc), on the old
system, and the same from the new system, and produces averages and deltas.
These are in a suitable format
eg.
{{$ benchstat old.txt new.txt}}
{{name old time/op new time/op delta}}
{{GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)}}
{{JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)}}
This would be a valuable way to produce and present results for users, and to
more easily validate what performance characteristics have changed between
versions.
Given the size and breadth and distributed nature of Beam and associated
infrastructure, this is something we likely only wish to do along with the
release. It will likely be time consuming, and for larger scale load tests on
cloud resources, expensive. In order to make meaningful comparisons, as much
as possible needs to be invariant between the releases under comparison.
In particular: if running on a distributed set of resources (eg cloud cluster)
the machine type and numbers should remain invariant (Spark and Flink clusters
should be the same size, dataflow being different is trickier but should be
unrestricted, as that's the point.) Local tests on a single machine are
comparable by themselves as well.
Included in the publishing, the specifics of the machine(s) being run on should
be included (CPU, clock, RAM amount, # of machines if distributed, official
cloud designation if using cloud provider VMs (AKA machine types, like
e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).
Overall goal is to be able to run the comparisons on a local machine, and be
able to send jobs to clusters in clouds. Actual provisioning of cloud resources
is a non-goal of this proposal.
Given a (set) of tests, we should be able to generate a text file with the
results, for collation similar to what go's benchstat does. Bonus points if we
can have benchstat handle the task for us without modification.
Similar to our release validation scripts, a release manager (or any user)
should be able to access and compare results.
Adding this kind of infrastructure will improve trust in Beam, beam releases,
and allow others to more consistently compare performance results.
This Jira stands as a proposal and if accepted, a place for discussion, and
hanging subtasks and specifics.
A side task that could be useful would be to be able to generate these text
file versions of the benchmarks from querying the metrics database. Then the
comparisons can be a few datapoints from around a given time point, to another,
which at least make the release managers job a little easier, though that
doesn't compare two releases.
> Automated Release Performance Benchmark Regression/Improvement comparison
> -------------------------------------------------------------------------
>
> Key: BEAM-11431
> URL: https://issues.apache.org/jira/browse/BEAM-11431
> Project: Beam
> Issue Type: Improvement
> Components: testing
> Reporter: Robert Burke
> Priority: P1
>
> While running the release, we have a step that has us check for Performance
> Regressions for our releases.
> [https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions]
>
> However, what we're able to check is the measured graphs over time. We don't
> have a clear indication of what these metrics were for the last release, only
> able to see vague trends in the line graph. To be clear, the line graph is
> excellent at seeing large sudden changes, or small changes over a large
> amount of time, it doesn't help the release manager very well.
> For one: infra might have changed in the mean time, such as compilers, test
> machine hardware, and load variables, along with the benchmarking code itself
> which makes comparing any two points in those graphs very difficult. Worst,
> they are only ever single runs, which puts them at the mercy of variance.
> Changes that are invariably good in all cases are difficult.
> This Jira proposes that we should make it possible to reproducibly
> performance test and compare two releases. In addition, we should be able to
> publish the results of our benchmarks along with the rest of the release
> artifacts, along with the comparison to the previous release.
> Obvious caveat: If there are new tests that can't run on the previous
> release, (or old tests that can't run on the new release) they're free to be
> excluded. This can be automatic by tagging the tests somehow, or publish
> explicit manual exclusions or inclusions. This implies that the tests are
> user side, and rely on a given set of released SDK or Runner artifacts for
> execution.
> Ideally the release manager can run a good chunk of these tests on their
> local machine, or a host in the cloud. Any such cloud resources should be
> identical for before and after comparisons. Eg. If one is comparing Flink
> performance, then the same machine types should be used to compare Beam
> version X and X+1.
> As inspiration, a Go tool called Benchstat does what I'm talking about for
> the Go Benchmark format. See the descriptions in the documentation here:
> [https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme]
>
> It takes the results from 1 or more runs of the given benchmark (measuring
> time per operation, or memory thoughput, or allocs per operation etc), on the
> old system, and the same from the new system, and produces averages and
> deltas. These are in a suitable format
> eg.
> {{$ benchstat old.txt new.txt}}
> {{name old time/op new time/op delta}}
> {{GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)}}
> {{JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)}}
> This would be a valuable way to produce and present results for users, and to
> more easily validate what performance characteristics have changed between
> versions.
> Given the size and breadth and distributed nature of Beam and associated
> infrastructure, this is something we likely only wish to do along with the
> release. It will likely be time consuming, and for larger scale load tests on
> cloud resources, expensive. In order to make meaningful comparisons, as much
> as possible needs to be invariant between the releases under comparison.
> In particular: if running on a distributed set of resources (eg cloud
> cluster) the machine type and numbers should remain invariant (Spark and
> Flink clusters should be the same size, dataflow being different is trickier
> but should be unrestricted, as that's the point.) Local tests on a single
> machine are comparable by themselves as well.
> Included in the publishing, the specifics of the machine(s) being run on
> should be included (CPU, clock, RAM amount, # of machines if distributed,
> official cloud designation if using cloud provider VMs (AKA machine types,
> like e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).
> Overall goal is to be able to run the comparisons on a local machine, and be
> able to send jobs to clusters in clouds. Actual provisioning of cloud
> resources is a non-goal of this proposal.
> Given a (set) of tests, we should be able to generate a text file with the
> results, for collation similar to what go's benchstat does. Bonus points if
> we can have benchstat handle the task for us without modification.
> Similar to our release validation scripts, a release manager (or any user)
> should be able to access and compare results.
> eg. ./release_comparison.sh \{$OLD_VERSION} \{$NEW_VERSION}
> It must be able to support Release Candidate versions.
> Adding this kind of infrastructure will improve trust in Beam, beam releases,
> and allow others to more consistently compare performance results.
> This Jira stands as a proposal and if accepted, a place for discussion, and
> hanging subtasks and specifics.
> A side task that could be useful would be to be able to generate these text
> file versions of the benchmarks from querying the metrics database. Then the
> comparisons can be a few datapoints from around a given time point, to
> another, which at least make the release managers job a little easier, though
> that doesn't compare two releases.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)