[jira] [Updated] (BEAM-11431) Automated Release Performance Benchmark Regression/Improvement comparison

Robert Burke (Jira) Wed, 09 Dec 2020 15:54:06 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Burke updated BEAM-11431:
--------------------------------
    Description: 
While running the release, we have a step that has us check for Performance 
Regressions for our releases.  
[https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions]
 

However, what we're able to check is the measured graphs over time. We don't 
have a clear indication of what these metrics were for the last release, only 
able to see vague trends in the line graph.  To be clear, the line graph is 
excellent at seeing large sudden changes, or small changes over a large amount 
of time, it doesn't help the release manager very well.

For one: infra might have changed in the mean time, such as compilers, test 
machine hardware, and load variables, along with the benchmarking code itself 
which makes comparing any two points in those graphs very difficult. Worst, 
they are only ever single runs, which puts them at the mercy of variance. 
Changes that are invariably good in all cases are difficult.

This Jira proposes that we should make it possible to reproducibly performance 
test and compare two releases. In addition, we should be able to publish the 
results of our benchmarks along with the rest of the release artifacts, along 
with the comparison to the previous release.

Obvious caveat: If there are new tests that can't run on the previous release, 
(or old tests that can't run on the new release) they're free to be excluded. 
This can be automatic by tagging the tests somehow, or publish explicit manual 
exclusions or inclusions. This implies that the tests are user side, and rely 
on a given set of released SDK or Runner artifacts for execution.

Ideally the release manager can run a good chunk of these tests on their local 
machine, or a host in the cloud. Any such cloud resources should be identical 
for before and after comparisons. Eg. If one is comparing Flink performance, 
then the same machine types should be used to compare Beam version X and X+1.

As inspiration, a Go tool called Benchstat does what I'm talking about for the 
Go Benchmark format. See the descriptions in the documentation here: 
[https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme]
 

It takes the results from 1 or more runs of the given benchmark (measuring time 
per operation, or memory thoughput, or allocs per operation etc), on the old 
system, and the same from the new system, and produces averages and deltas. 
These are in a suitable format 

eg.

{{$ benchstat old.txt new.txt}}
 {{name old time/op new time/op delta}}
 {{GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)}}
 {{JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)}}

This would be a valuable way to produce and present results for users, and to 
more easily validate what performance characteristics have changed between 
versions.

Given the size and breadth and distributed nature of Beam and associated 
infrastructure, this is something we likely only wish to do along with the 
release. It will likely be time consuming, and for larger scale load tests on 
cloud resources, expensive. In order to make meaningful  comparisons, as much 
as possible needs to be invariant between the releases under comparison.

In particular: if running on a distributed set of resources (eg cloud cluster) 
the machine type and numbers should remain invariant (Spark and Flink clusters 
should be the same size, dataflow being different is trickier but should be 
unrestricted, as that's the point.) Local tests on a single machine are 
comparable by themselves as well.

Included in the publishing, the specifics of the machine(s) being run on should 
be included (CPU, clock, RAM amount, # of machines if distributed, official 
cloud designation if using cloud provider VMs (AKA machine types, like 
e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).

Overall goal is to be able to run the comparisons on a local machine, and be 
able to send jobs to clusters in clouds. Actual provisioning of cloud resources 
is a non-goal of this proposal.

Given a (set) of tests, we should be able to generate a text file with the 
results, for collation similar to what go's benchstat does. Bonus points if we 
can have benchstat handle the task for us without modification.

Similar to our release validation scripts, a release manager (or any user) 
should be able to access and compare results.

eg. ./release_comparison.sh \{$OLD_VERSION} \{$NEW_VERSION}

It must be able to support Release Candidate versions.

Adding this kind of infrastructure will improve trust in Beam, beam releases, 
and allow others to more consistently compare performance results.

This Jira stands as a proposal and if accepted, a place for discussion, and 
hanging subtasks and specifics.

A side task that could be useful would be to be able to generate these text 
file versions of the benchmarks from querying the metrics database. Then the 
comparisons can be a few datapoints from around a given time point, to another, 
which at least make the release managers job a little easier, though that 
doesn't compare two releases.

  was:
While running the release, we have a step that has us check for Performance 
Regressions for our releases.  
[https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions]
 

However, what we're able to check is the measured graphs over time. We don't 
have a clear indication of what these metrics were for the last release, only 
able to see vague trends in the line graph.  To be clear, the line graph is 
excellent at seeing large sudden changes, or small changes over a large amount 
of time, it doesn't help the release manager very well.

For one: infra might have changed in the mean time, such as compilers, test 
machine hardware, and load variables, along with the benchmarking code itself 
which makes comparing any two points in those graphs very difficult. Worst, 
they are only ever single runs, which puts them at the mercy of variance. 
Changes that are invariably good in all cases are difficult.

This Jira proposes that we should make it possible to reproducibly performance 
test and compare two releases. In addition, we should be able to publish the 
results of our benchmarks along with the rest of the release artifacts, along 
with the comparison to the previous release.

Obvious caveat: If there are new tests that can't run on the previous release, 
(or old tests that can't run on the new release) they're free to be excluded. 
This can be automatic by tagging the tests somehow, or publish explicit manual 
exclusions or inclusions. This implies that the tests are user side, and rely 
on a given set of released SDK or Runner artifacts for execution.

Ideally the release manager can run a good chunk of these tests on their local 
machine, or a host in the cloud. Any such cloud resources should be identical 
for before and after comparisons. Eg. If one is comparing Flink performance, 
then the same machine types should be used to compare Beam version X and X+1.

As inspiration, a Go tool called Benchstat does what I'm talking about for the 
Go Benchmark format. See the descriptions in the documentation here: 
[https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme]
 

It takes the results from 1 or more runs of the given benchmark (measuring time 
per operation, or memory thoughput, or allocs per operation etc), on the old 
system, and the same from the new system, and produces averages and deltas. 
These are in a suitable format 

eg.

{{$ benchstat old.txt new.txt}}
{{name        old time/op  new time/op  delta}}
{{GobEncode   13.6ms ± 1%  11.8ms ± 1%  -13.31% (p=0.016 n=4+5)}}
{{JSONEncode  32.1ms ± 1%  31.8ms ± 1%     ~    (p=0.286 n=4+5)}}



This would be a valuable way to produce and present results for users, and to 
more easily validate what performance characteristics have changed between 
versions.



Given the size and breadth and distributed nature of Beam and associated 
infrastructure, this is something we likely only wish to do along with the 
release. It will likely be time consuming, and for larger scale load tests on 
cloud resources, expensive. In order to make meaningful  comparisons, as much 
as possible needs to be invariant between the releases under comparison.

In particular: if running on a distributed set of resources (eg cloud cluster) 
the machine type and numbers should remain invariant (Spark and Flink clusters 
should be the same size, dataflow being different is trickier but should be 
unrestricted, as that's the point.) Local tests on a single machine are 
comparable by themselves as well.

Included in the publishing, the specifics of the machine(s) being run on should 
be included (CPU, clock, RAM amount, # of machines if distributed, official 
cloud designation if using cloud provider VMs (AKA machine types, like 
e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).

Overall goal is to be able to run the comparisons on a local machine, and be 
able to send jobs to clusters in clouds. Actual provisioning of cloud resources 
is a non-goal of this proposal.

Given a (set) of tests, we should be able to generate a text file with the 
results, for collation similar to what go's benchstat does. Bonus points if we 
can have benchstat handle the task for us without modification.

Similar to our release validation scripts, a release manager (or any user) 
should be able to access and compare results.

Adding this kind of infrastructure will improve trust in Beam, beam releases, 
and allow others to more consistently compare performance results.

This Jira stands as a proposal and if accepted, a place for discussion, and 
hanging subtasks and specifics.

A side task that could be useful would be to be able to generate these text 
file versions of the benchmarks from querying the metrics database. Then the 
comparisons can be a few datapoints from around a given time point, to another, 
which at least make the release managers job a little easier, though that 
doesn't compare two releases.


> Automated Release Performance Benchmark Regression/Improvement comparison
> -------------------------------------------------------------------------
>
>                 Key: BEAM-11431
>                 URL: https://issues.apache.org/jira/browse/BEAM-11431
>             Project: Beam
>          Issue Type: Improvement
>          Components: testing
>            Reporter: Robert Burke
>            Priority: P1
>
> While running the release, we have a step that has us check for Performance 
> Regressions for our releases.  
> [https://beam.apache.org/contribute/release-guide/#3-investigate-performance-regressions]
>  
> However, what we're able to check is the measured graphs over time. We don't 
> have a clear indication of what these metrics were for the last release, only 
> able to see vague trends in the line graph.  To be clear, the line graph is 
> excellent at seeing large sudden changes, or small changes over a large 
> amount of time, it doesn't help the release manager very well.
> For one: infra might have changed in the mean time, such as compilers, test 
> machine hardware, and load variables, along with the benchmarking code itself 
> which makes comparing any two points in those graphs very difficult. Worst, 
> they are only ever single runs, which puts them at the mercy of variance. 
> Changes that are invariably good in all cases are difficult.
> This Jira proposes that we should make it possible to reproducibly 
> performance test and compare two releases. In addition, we should be able to 
> publish the results of our benchmarks along with the rest of the release 
> artifacts, along with the comparison to the previous release.
> Obvious caveat: If there are new tests that can't run on the previous 
> release, (or old tests that can't run on the new release) they're free to be 
> excluded. This can be automatic by tagging the tests somehow, or publish 
> explicit manual exclusions or inclusions. This implies that the tests are 
> user side, and rely on a given set of released SDK or Runner artifacts for 
> execution.
> Ideally the release manager can run a good chunk of these tests on their 
> local machine, or a host in the cloud. Any such cloud resources should be 
> identical for before and after comparisons. Eg. If one is comparing Flink 
> performance, then the same machine types should be used to compare Beam 
> version X and X+1.
> As inspiration, a Go tool called Benchstat does what I'm talking about for 
> the Go Benchmark format. See the descriptions in the documentation here: 
> [https://pkg.go.dev/golang.org/x/perf/cmd/benchstat?readme=expanded#section-readme]
>  
> It takes the results from 1 or more runs of the given benchmark (measuring 
> time per operation, or memory thoughput, or allocs per operation etc), on the 
> old system, and the same from the new system, and produces averages and 
> deltas. These are in a suitable format 
> eg.
> {{$ benchstat old.txt new.txt}}
>  {{name old time/op new time/op delta}}
>  {{GobEncode 13.6ms ± 1% 11.8ms ± 1% -13.31% (p=0.016 n=4+5)}}
>  {{JSONEncode 32.1ms ± 1% 31.8ms ± 1% ~ (p=0.286 n=4+5)}}
> This would be a valuable way to produce and present results for users, and to 
> more easily validate what performance characteristics have changed between 
> versions.
> Given the size and breadth and distributed nature of Beam and associated 
> infrastructure, this is something we likely only wish to do along with the 
> release. It will likely be time consuming, and for larger scale load tests on 
> cloud resources, expensive. In order to make meaningful  comparisons, as much 
> as possible needs to be invariant between the releases under comparison.
> In particular: if running on a distributed set of resources (eg cloud 
> cluster) the machine type and numbers should remain invariant (Spark and 
> Flink clusters should be the same size, dataflow being different is trickier 
> but should be unrestricted, as that's the point.) Local tests on a single 
> machine are comparable by themselves as well.
> Included in the publishing, the specifics of the machine(s) being run on 
> should be included (CPU, clock, RAM amount, # of machines if distributed, 
> official cloud designation if using cloud provider VMs (AKA machine types, 
> like e2-standard-4 or n2-highcpu-32, or c6g.4xlarge or D8d v4).
> Overall goal is to be able to run the comparisons on a local machine, and be 
> able to send jobs to clusters in clouds. Actual provisioning of cloud 
> resources is a non-goal of this proposal.
> Given a (set) of tests, we should be able to generate a text file with the 
> results, for collation similar to what go's benchstat does. Bonus points if 
> we can have benchstat handle the task for us without modification.
> Similar to our release validation scripts, a release manager (or any user) 
> should be able to access and compare results.
> eg. ./release_comparison.sh \{$OLD_VERSION} \{$NEW_VERSION}
> It must be able to support Release Candidate versions.
> Adding this kind of infrastructure will improve trust in Beam, beam releases, 
> and allow others to more consistently compare performance results.
> This Jira stands as a proposal and if accepted, a place for discussion, and 
> hanging subtasks and specifics.
> A side task that could be useful would be to be able to generate these text 
> file versions of the benchmarks from querying the metrics database. Then the 
> comparisons can be a few datapoints from around a given time point, to 
> another, which at least make the release managers job a little easier, though 
> that doesn't compare two releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-11431) Automated Release Performance Benchmark Regression/Improvement comparison

Reply via email to