[GitHub] [arrow-rs] alamb opened a new issue, #4141: Easy DataFusion vs DataFusion benchmarking

via GitHub Wed, 26 Apr 2023 08:34:27 -0700


alamb opened a new issue, #4141:
URL: https://github.com/apache/arrow-rs/issues/4141


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   I want an easy way to run and compare the performance on branches for 
various database benchmarks. For example, I want a single command to run and 
get a report that tells me  "does this PR make DataFusion faster or slower". 
This most recently came up as part of 
https://github.com/apache/arrow-datafusion/pull/6034
   
   DataFusion has [several benchmark 
runners](https://github.com/apache/arrow-datafusion/tree/main/benchmarks) but 
they have grown "organically" and are hard to use require manually downloading 
of datasets, and are not very easy to run or reproduce (see discussions on 
https://github.com/apache/arrow-datafusion/pull/6034#issuecomment-1521511462)
   
   Right now, it is cumbersome to do so -- I need to know how to create the 
appropriate datasets, build the runners, convert the dataset to parquet 
(potentially), run the benchmarks, and then build a report. 
   
   This is made more challenging by the fact that the runners need to be built 
in release mode which is slow (takes several minutes per cycle)
   
   
   **Describe the solution you'd like**
   I want a documented methodology (ideally in a script) that will do:
   
   1. Setup (creates / downloads / whatveer) the datafiles needed
   2. Run <name> <optional arguments to restrict what benchmarks are run>  that 
writes timing information into log files
   3. Compare writes out a report comparing the runs
   
   
   Performance is a key differentiator for DataFusion. We often see third party 
benchmarks comparing performance to other systems - e.g .YYYYY
   
   However, in addition to comparing to different systems, we also need to 
compare the performance of DataFusion over time. Initially I need an easier way 
to compare DataFusion performance with a proposed change
   
   We currently have the tpch benchmark (links) and I have a jenky script that 
can compare performance with the main branch: 
https://github.com/alamb/datafusion-benchmarking/blob/1f0beb5d32c39b6cc576e9846cddc40e692d181f/bench.sh
   
   however the TPCH script it is a fairly small set of queries and may not 
cover all the interesting usecases (e.g. many of its queries have non trivial 
joins)
   
   
   
   I would like to extend the built in benchmark runners to include:
   * Add clickbench benchmark runner in datafusion
   * Add h20ai style benchmark runner in datafusion: 
https://duckdb.org/2023/04/14/h2oai.html
   
   I propose renaming tpch to `runner` (and keep an alias for tpch)
   
   The runner should do:
   
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   This will likely result in cleaning up the runners in 
https://github.com/apache/arrow-datafusion/issues/5502
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb opened a new issue, #4141: Easy DataFusion vs DataFusion benchmarking

Reply via email to