alamb opened a new issue, #8860: URL: https://github.com/apache/arrow-datafusion/issues/8860
### Is your feature request related to a problem or challenge? The [ClickBench](https://benchmark.clickhouse.com/) benchmark has excellent coverage for aggregate / grouping We have used the clickbench benchmark, run via `bench.sh`, for important work improving aggregates such as https://github.com/apache/arrow-datafusion/issues/6988 and https://github.com/apache/arrow-datafusion/issues/7064. However there are some important optimizations like https://github.com/apache/arrow-datafusion/pull/8849 and https://github.com/apache/arrow-datafusion/issues/7191 from @avantgardnerio where the clickbench benchmark does not cover the existing usecase For example, @jayzhan211 's change in https://github.com/apache/arrow-datafusion/pull/8849#issuecomment-1890482901 makes certain realistic queries <details><summary>Details on `bench.sh`</summary> <p> ```shell $ ./benchmarks/bench.sh --help Orchestrates running benchmarks against DataFusion checkouts Usage: ./benchmarks/bench.sh data [benchmark] ./benchmarks/bench.sh run [benchmark] ./benchmarks/bench.sh compare <branch1> <branch2> ********** Examples: ********** # Create the datasets for all benchmarks in /Users/andrewlamb/Software/arrow-datafusion/benchmarks/data ./bench.sh data # Run the 'tpch' benchmark on the datafusion checkout in /source/arrow-datafusion DATAFASION_DIR=/source/arrow-datafusion ./bench.sh run tpch ********** * Commands ********** data: Generates data needed for benchmarking run: Runs the named benchmark compare: Comares results from benchmark runs ********** * Benchmarks ********** all(default): Data/Run/Compare for all benchmarks tpch: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table tpch_mem: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table tpch10_mem: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory parquet: Benchmark of parquet reader's filtering speed sort: Benchmark of sorting speed clickbench_1: ClickBench queries against a single parquet file clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet ********** * Supported Configuration (Environment Variables) ********** DATA_DIR directory to store datasets CARGO_COMMAND command that runs the benchmark binary DATAFASION_DIR directory to use (default /Users/andrewlamb/Software/arrow-datafusion/benchmarks/..) ``` </p> </details> ### Describe the solution you'd like I would like to add a new benchmark to `bench.sh` that uses the same dataset but has different queries than the existing ```shell $ ./benchmarks/bench.sh run clickbench_extended ``` The new queries should be 1. realistic (can write an English sentence explaining the quantity the compute and how it might be used) 2. Reflect some query pattern Here is an example from https://github.com/apache/arrow-datafusion/pull/8849#issuecomment-1890482901 ## Query: Distinct counts Query Explanation: Data exploration: understand the qualities of the data in `hits.parquet` Query Properties: multiple count distinct aggregates on string datatypes ```sql ❯ SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM 'hits.parquet'; ``` ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
