Re: Benchmarking mailing list thread [was Fwd: [Discuss] Benchmarking infrastructure]

Francois Saint-Jacques Wed, 24 Apr 2019 21:02:48 -0700

Hello,

archery is the "shim" scripts that glue some of the steps (2-4) that you
described. It builds arrow (c++ for now), find the multiple benchmark
binaries, runs them, and collects the outputs. I encourage you to check the
implementation, notably [1] and [2] (and generally [3]).

Think of it as merging the steps 2-4 into a single script without the CI's
orchestration (steps in the pipeline). Thus, making it CI agnostic and
reproducible locally.

To put you in context, here's some "user stories" we'd like to achieve
(ARROW-5070):

1. Performance data should be tracked and stored in a database for each commit
   in the master branch. (ARROW-5071)

2. A reviewer should be able to trigger an on-demand regression check in a PR
   (ARROW-5071 and some ursabot stuff). Feedback (regression or not) should
   be given either via a PR status, an automated comment, or a bot-user
   declined review. (ARROW-5071)

3. A developer should be able to compare (diff) builds locally. By build, I
   mean cmake build directory, e.g. it can be a toolchain change, or different
   compiler flags. (ARROW-4827)

4. A developer should be able to compare commits locally. (ARROW-4827)

The current iteration of archery does 3 and 4 (via `archery benchmark diff`),
and easily modifiable to do 1 and 2 (via `archery benchmark` and minus the
infrastructure setup).  What you're proposing is only targeting 1, maybe 2,
but definitively not 3 and 4.

Various reason why I think the archery route is preferred over a mix of
scattered scripts, CI pipeline steps and random go binaries.

1. It is OS agnostic since it's written in python, and depends on cmake + git
   installed in PATH.

2. Self contained in arrow's repository, no need to manually install external
   dependencies (go toolchain, then compile & install benchstat, benchcmp).
   Assuming python3 and pip are provided, which we already need for pyarrow.

3. Written as a library where the command line is a frontend. This makes it
   very easy to test and re-use. It also opens the door to clearing
   technical debt we've accumulated in `dev/`. This is not relevant for the
   benchmark sub-project, but still relevant for arrow developers in general.

4. Benchmark framework agnostic. This does not depend on google's benchmark and
   go benchmark output format. It does support it, but does not mandate it.
   Will be key to support Python (ASV) and other languages.

5. Shell scripts tend to grow un-maintenable. I say this as someone who abuse
   them. (archery implementation is derived from a local bash script).

6. It is not orchestrated by a complex CI pipeline (which effectively is a
   non-portable hardly reproducible script). It is self contained, can run
   within a CI or on a local machine. This is very convenient for local testing
   and debugging. I loathe waiting for the CI, especially when
iterating in development.

You can get a sneak peek at of automation working here
http://nashville.ursalabs.org:4100/#/builders/16/builds/129,
note that this doesn't use dedicated hardware yet.

François

[1] 
https://github.com/apache/arrow/blob/2a953f1808566da01bbb90faeabe8131ff55f902/dev/archery/archery/benchmark/google.py
[2] 
https://github.com/apache/arrow/blob/2a953f1808566da01bbb90faeabe8131ff55f902/dev/archery/archery/benchmark/runner.py
[3] https://github.com/apache/arrow/pull/4141/files

On Wed, Apr 24, 2019 at 9:24 PM Melik-Adamyan, Areg
<areg.melik-adam...@intel.com> wrote:
>
> Wes,
>
> The process as I think should be the following.
> 1. Commit triggers to build in TeamCity. I have set the TeamCity, but we can 
> use whatever CI we would like.
> 2. TeamCity is using the pool of identical machines to run the predefined (or 
> all) performance benchmarks on one the build machines from the pool.
> 3. Each benchmark generates output - by using Google Benchmarks we generate 
> JSON format file.
> 4. The build step in the TeamCity which runs the performance gathers all 
> those files and parses them.
> 5. For each parsed output it creates an entry in the DB with the commit ID as 
> a key and auxiliary information that can be helpful.
> 6. The codespeed sitting on top of that Database visualize data in the 
> dashboard by marking regressions as red and progressions as green compared to 
> either baseline which you define or previous commit, as all the commits are 
> ordered in the time.
> 7. You can create custom queries to compare specific commits or see trends on 
> the timeline.
>
> I am not mandating codespeed or anything else, but we should start with 
> something. We can use something more sophisticated, like Influx.
>
> > In the benchmarking one of the hardest parts (IMHO) is the process/workflow
> > automation. I'm in support of the development of a "meta-benchmarking"
> > framework that offers automation, extensibility, and possibility for
> > customization.
> [>] Meta is good, and I am totally supporting it, but meanwhile we are doing 
> that there is a need for something very simple but usable.
> >
> > One of the reasons that people don't do more benchmarking as part of their
> > development process is that the tooling around it isn't great.
> > Using a command line tool [1] that outputs unconfigurable text to the 
> > terminal
> > to compare benchmarks seems inadequate to me.
> [>] I would argue here - it is the minimal config that works with external 
> tooling without creating huge infrastructure around it. We already use Google 
> Benchmark library which provides all the needed output format. And if you do 
> not like CodeSpeed we can use anything else, e.g. Dana 
> (https://github.com/google/dana) from Google.
> >
> > In the cited example
> >
> > $ benchcmp old.txt new.txt
> >
> > Where do old.txt and new.txt come from? I would like to have that detail 
> > (build
> > of appropriate component, execution of benchmarks and collection of results)
> > automated.
> [>]In the case of Go it is: $go test -run=^$ -bench=. ./... > old.txt
> Then you switch to the new branch and do the same with >new.txt then you do 
> benchcmp and it does the comparison. 3 bash commands.
>
> >
> > FWIW, 7 and a half years ago [2] I wrote a small project called vbench to 
> > assist
> > with benchmark automation, so this has been a long-term interest of mine.
> > Codespeed existed in 2011, here is what I wrote about it in December 2011,
> > and it is definitely odd to find myself typing almost the exact same words 
> > years
> > later:
> >
> > "Before starting to write a new project I looked briefly at codespeed... The
> > dealbreaker is that codespeed is just a web application. It doesn't 
> > actually (to
> > my knowledge, someone correct me if I'm wrong?) have any kind of a
> > framework for orchestrating the running of benchmarks throughout your code
> > history."
> [>] I totally agree with you. But the good part is that it doesn't need to 
> have orchestration. TeamCitry or any other CI will do those steps for you. 
> And the fact that you can run the benchmarks by hand and CI can just 
> replicate your actions make suitable for most of the cases. And I don't care 
> about codespeed or asv, as you said it is just a stupid web app. The most 
> important part is to create a working pipeline. While we are looking for the 
> best salt-cellar, we can use the plastic one. :)
> >
> > asv [3] is a more modern and evolved version of vbench. But it's Python-
> > specific. I think we need the same kind of thing except being able to 
> > automate
> > the execution of any benchmarks for any component in the Arrow project. So
> > we have some work to do.
> [>] Here is the catch - trying to do for any benchmarks will consume time and 
> resources, and still there will be something left behind. It is hard to cover 
> general case, and assume that the particular one, like C++ will be covered.
>
> >
> > - Wes
> >
> > [1]:
> > https://github.com/golang/tools/blob/master/cmd/benchcmp/benchcmp.go
> > [2]: http://wesmckinney.com/blog/introducing-vbench-new-code-performance-
> > analysis-and-monitoring-tool/
> > [3]: https://github.com/airspeed-velocity/asv
> >
> > On Wed, Apr 24, 2019 at 11:18 AM Sebastien Binet <bi...@cern.ch> wrote:
> > >
> > > On Wed, Apr 24, 2019 at 11:22 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >
> > > >
> > > > Hi Areg,
> > > >
> > > > Le 23/04/2019 à 23:43, Melik-Adamyan, Areg a écrit :
> > > > > Because we are using Google Benchmark, which has specific format
> > > > > there
> > > > is a tool called becnhcmp which compares two runs:
> > > > >
> > > > > $ benchcmp old.txt new.txt
> > > > > benchmark           old ns/op     new ns/op     delta
> > > > > BenchmarkConcat     523           68.6          -86.88%
> > > > >
> > > > > So the comparison part is done and there is no need to create
> > > > > infra for
> > > > that.
> > > >
> > >
> > > "surprisingly" Go is already using that benchmark format :) and (on
> > > top of a Go-based benchcmp command) there is also a benchstat command
> > > that, given a set of multiple before/after data points adds some
> > > amount of statistical analysis:
> > >  https://godoc.org/golang.org/x/perf/cmd/benchstat
> > >
> > > using the "benchmark" file format of benchcmp and benchstat would
> > > allow better cross-language interop.
> > >
> > > cheers,
> > > -s

Re: Benchmarking mailing list thread [was Fwd: [Discuss] Benchmarking infrastructure]

Reply via email to