alamb commented on issue #5504: URL: https://github.com/apache/arrow-datafusion/issues/5504#issuecomment-1763339527
Hi @Smurphy000 that would be amazing 🙏 . This is one of the issues I think is critical to the long term success of DataFusion but has been hard to attract attention for. The key, in my mind, is to minimize the complexity and infrastructure requirements of this solution, as DataFusion doesn't (yet) have the kind of resources to keep a custom system operating. # Step 1: transform benchmark data for graphing I first recommend checking out https://github.com/apache/arrow-datafusion/issues/6107 and seeing if you can write a python script / rust program that takes the json output of a benchmark run and makes a single line for each query run with the relevant parameters. That issue has example data and desired output. You might also have to extend the rust benchmark runner. In terms of implementation, I suggest starting with one setup only (InfluxData can supply a machine / VM initially -- likely a 8core, 32GB of ram machine) and then we can expand the tested combinations as our needs do as well # Step 2: script to gather data for each commit So in my mind, the ideal solution looks like: 1. A runner script that runs the benchmarks, and appends the results to some sort of text file (ideally https://github.com/apache/arrow-datafusion/issues/6107) that we can check in and that is easy to visualize 2. Written in one of the existing languages used in the DataFusion repo: python, bash, or rust If you fancy a bit of bash scripting, maybe you could potentially start with [bench.sh](https://github.com/apache/arrow-datafusion/blob/main/benchmarks/bench.sh) and extend it to check out the desired SHAs (I could do do this part too, if you were able to do https://github.com/apache/arrow-datafusion/issues/6107) Here is one way a testing session might look ``` # setup, fetch all needed data files ./benchmarks/bench.sh data # Run tpch benchmarh on commit 4819e7a, # leave results in ./benchmarks/results/4819e7a # Would add appropriate lines to ./benchmarks/results/history.lp ./benchmarks/bench.sh run tpch --commit 4819e7a ``` Then we could write up instructions on how to visualize the data in history.lp (I would probably use influxdb/grafana as that is what I know) Does that make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
