Omega359 commented on issue #21165: URL: https://github.com/apache/datafusion/issues/21165#issuecomment-4139727285
> Can write SQL benchmarks > > Currently writing benchmarks requires using dataframe APIs and quite a bit of ceremony (recent example: https://github.com/apache/datafusion/pull/21180). > I would like it to be possible to write SQL benchmarks, including with some SQL or non-SQL setup (could be a bash script to download data), even if a bit of rust is required (e.g. sql_bench!("../q1/")). > > This has several advantages: > > Less code / boilerplate. > Benchmarks are more in line with real world usage. > Can tweak benchmarks without recompiling. > > Proposal > > Macros/harness code to easily point at a directory of SQL files and generate test cases with setup, etc. I know @adriangb has a [draft poc ](https://github.com/apache/datafusion/pull/20911) for porting dfbench to criterion however I got nerd sniped by this and decided to have a go at something similar. The idea is essentially that the benchmarks run by bench.sh/dfbench will mostly be converted from being code based to being almost completely sql based. Thus, I'm in the process of doing a poc for converting [DuckDB's sql benchmark suite](https://github.com/duckdb/duckdb/tree/main/benchmark) to work for Datafusion (rewritten in Rust). I've got it working for a few benchmark suites (imdb, taxi) and I'm working on adding the the various clickbench benchmarks now. It's criterion based though adding/switching to Divan shouldn't be much of an issue. License will be the same as DuckDB since it's largely converted from it using LLM's (MIT license, compatible with Apache). I think I'm about 3 weeks away from having something that I feel would be ready for a draft PR. It's likely that I'll use the bench.sh/dfbench as a basis for making it easy to run a benchmark (with on demand download of the data, etc) and datafusion-cli for any conversion/transformation of data (csv -> parquet for example). Currently it's 100% sql based except for downloading of data. `BENCH_NAME=imdb cargo bench --bench sql` ConfigOptions can be set in sql either during init so it's valid across the whole benchmark suite, per benchmark, or via the env via the use of `SessionConfig::from_env()` for on demand; Lifecycle of a benchmark is currently - `initialize` (load + init), - `assert` (verify conditions via sql if you want), - `run` (the part that is executed/benchmarked by criterion), - `verify` (optional result validation) - `cleanup` (drop data, etc). I'm looking at adding auto creation of the result data similar to how the sqlogictests behave. since manually doing it is not entirely feasible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
