Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

via GitHub Sun, 15 Oct 2023 02:55:36 -0700


alamb commented on issue #5504:
URL: 
https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1763339527


   Hi @Smurphy000  that would be amazing 🙏 . This is one of the issues I think 
is critical to the long term success of DataFusion but has been hard to attract 
attention for.
   
   The key, in my mind, is to minimize the complexity and infrastructure 
requirements of this solution, as DataFusion doesn't (yet) have the kind of 
resources to keep a custom system operating.
   
   # Step 1: transform benchmark data for graphing
   
   I first recommend checking out 
https://github.com/apache/arrow-datafusion/issues/6107 and seeing if you can 
write a python script / rust program that takes the json output of a benchmark 
run and makes a single line for each query run with the relevant parameters. 
That issue has example data and desired output. You might also have to extend 
the rust benchmark runner. 
   
   In terms of implementation, I suggest starting with one setup only 
(InfluxData can supply a machine / VM initially -- likely a 8core, 32GB of ram 
machine) and then we can expand the tested combinations as our needs do as well 
   
   # Step 2: script to gather data for each commit
   
   So in my mind, the ideal solution looks like:
   1. A runner script that runs the benchmarks, and appends the results to some 
sort of text file (ideally 
https://github.com/apache/arrow-datafusion/issues/6107) that we can check in 
and that is easy to visualize
   2. Written in one of the existing languages used in the DataFusion repo: 
python, bash, or rust
   
   
   If you fancy a bit of bash scripting, maybe you could potentially start with 
 
[bench.sh](https://github.com/apache/arrow-datafusion/blob/main/benchmarks/bench.sh)
 and extend it to check out the desired SHAs (I could do do this part too, if 
you were able to do https://github.com/apache/arrow-datafusion/issues/6107)
   
   
   Here is  one way a testing session might look
   ```
   # setup, fetch all needed data files
   ./benchmarks/bench.sh data 
   
   # Run tpch benchmarh on commit 4819e7a, 
   # leave results in ./benchmarks/results/4819e7a
   # Would add appropriate lines to  ./benchmarks/results/history.lp
   ./benchmarks/bench.sh run tpch --commit 4819e7a 
   ```
   
   Then we could write up instructions on how to visualize the data in 
history.lp (I would probably use influxdb/grafana as that is what I know)
   
   Does that make sense?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Run DataFusion benchmarks regularly and track performance history over time [arrow-datafusion]

Reply via email to