alamb opened a new issue, #4141: URL: https://github.com/apache/arrow-rs/issues/4141
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** I want an easy way to run and compare the performance on branches for various database benchmarks. For example, I want a single command to run and get a report that tells me "does this PR make DataFusion faster or slower". This most recently came up as part of https://github.com/apache/arrow-datafusion/pull/6034 DataFusion has [several benchmark runners](https://github.com/apache/arrow-datafusion/tree/main/benchmarks) but they have grown "organically" and are hard to use require manually downloading of datasets, and are not very easy to run or reproduce (see discussions on https://github.com/apache/arrow-datafusion/pull/6034#issuecomment-1521511462) Right now, it is cumbersome to do so -- I need to know how to create the appropriate datasets, build the runners, convert the dataset to parquet (potentially), run the benchmarks, and then build a report. This is made more challenging by the fact that the runners need to be built in release mode which is slow (takes several minutes per cycle) **Describe the solution you'd like** I want a documented methodology (ideally in a script) that will do: 1. Setup (creates / downloads / whatveer) the datafiles needed 2. Run <name> <optional arguments to restrict what benchmarks are run> that writes timing information into log files 3. Compare writes out a report comparing the runs Performance is a key differentiator for DataFusion. We often see third party benchmarks comparing performance to other systems - e.g .YYYYY However, in addition to comparing to different systems, we also need to compare the performance of DataFusion over time. Initially I need an easier way to compare DataFusion performance with a proposed change We currently have the tpch benchmark (links) and I have a jenky script that can compare performance with the main branch: https://github.com/alamb/datafusion-benchmarking/blob/1f0beb5d32c39b6cc576e9846cddc40e692d181f/bench.sh however the TPCH script it is a fairly small set of queries and may not cover all the interesting usecases (e.g. many of its queries have non trivial joins) I would like to extend the built in benchmark runners to include: * Add clickbench benchmark runner in datafusion * Add h20ai style benchmark runner in datafusion: https://duckdb.org/2023/04/14/h2oai.html I propose renaming tpch to `runner` (and keep an alias for tpch) The runner should do: **Describe alternatives you've considered** <!-- A clear and concise description of any alternative solutions or features you've considered. --> **Additional context** This will likely result in cleaning up the runners in https://github.com/apache/arrow-datafusion/issues/5502 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
