Re: [I] [EPIC] Benchmark improvements [datafusion]

via GitHub Thu, 26 Mar 2026 19:39:14 -0700


Omega359 commented on issue #21165:
URL: https://github.com/apache/datafusion/issues/21165#issuecomment-4139727285


   > Can write SQL benchmarks
   > 
   > Currently writing benchmarks requires using dataframe APIs and quite a bit 
of ceremony (recent example: https://github.com/apache/datafusion/pull/21180).
   > I would like it to be possible to write SQL benchmarks, including with 
some SQL or non-SQL setup (could be a bash script to download data), even if a 
bit of rust is required (e.g. sql_bench!("../q1/")).
   > 
   > This has several advantages:
   > 
   >     Less code / boilerplate.
   >     Benchmarks are more in line with real world usage.
   >     Can tweak benchmarks without recompiling.
   > 
   > Proposal
   > 
   > Macros/harness code to easily point at a directory of SQL files and 
generate test cases with setup, etc.
   
   I know @adriangb has a [draft poc 
](https://github.com/apache/datafusion/pull/20911) for porting dfbench to 
criterion however I got nerd sniped by this and decided to have a go at 
something similar. The idea is essentially that the benchmarks run by 
bench.sh/dfbench will mostly be converted from being code based to being almost 
completely sql based.
   
   Thus, I'm in the process of doing a poc for converting [DuckDB's sql 
benchmark suite](https://github.com/duckdb/duckdb/tree/main/benchmark) to work 
for Datafusion (rewritten in Rust). I've got it working for a few benchmark 
suites (imdb, taxi) and I'm working on adding the the various clickbench 
benchmarks now. It's criterion based though adding/switching to Divan shouldn't 
be much of an issue. License will be the same as DuckDB since it's largely 
converted from it using LLM's (MIT license, compatible with Apache). I think 
I'm about 3 weeks away from having something that I feel would be ready for a 
draft PR.
   
   It's likely that I'll use the bench.sh/dfbench as a basis for making it easy 
to run a benchmark (with on demand download of the data, etc) and 
datafusion-cli for any conversion/transformation of data (csv -> parquet for 
example). Currently it's 100% sql based except for downloading of data.
   
   `BENCH_NAME=imdb cargo bench --bench sql`
   
   ConfigOptions can be set in sql either during init so it's valid across the 
whole benchmark suite, per benchmark, or via the env via the use of 
`SessionConfig::from_env()` for on demand;
   
   Lifecycle of a benchmark is currently
   
   - `initialize` (load + init), 
   - `assert` (verify conditions via sql if you want), 
   - `run` (the part that is executed/benchmarked by criterion), 
   - `verify` (optional result validation) 
   - `cleanup` (drop data, etc). 
    
   I'm looking at adding auto creation of the result data similar to how the 
sqlogictests behave. since manually doing it is not entirely feasible.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [EPIC] Benchmark improvements [datafusion]

Reply via email to