[I] Add 'clickbench_extended' benchmark [arrow-datafusion]

via GitHub Sun, 14 Jan 2024 04:02:24 -0800


alamb opened a new issue, #8860:
URL: https://github.com/apache/arrow-datafusion/issues/8860


   ### Is your feature request related to a problem or challenge?
   
   The [ClickBench](https://benchmark.clickhouse.com/) benchmark has excellent 
coverage for aggregate / grouping 
   
   We have used the clickbench benchmark, run via `bench.sh`,  for important 
work improving aggregates such as 
https://github.com/apache/arrow-datafusion/issues/6988  and 
https://github.com/apache/arrow-datafusion/issues/7064. However there are some 
important optimizations like 
https://github.com/apache/arrow-datafusion/pull/8849 and 
https://github.com/apache/arrow-datafusion/issues/7191 from @avantgardnerio 
where the clickbench benchmark does not cover the existing usecase
   
   For example, @jayzhan211 's change in 
https://github.com/apache/arrow-datafusion/pull/8849#issuecomment-1890482901 
makes certain realistic queries 
   
   
   <details><summary>Details on `bench.sh`</summary>
   <p>
   
   ```shell
   $ ./benchmarks/bench.sh --help
   
   Orchestrates running benchmarks against DataFusion checkouts
   
   Usage:
   ./benchmarks/bench.sh data [benchmark]
   ./benchmarks/bench.sh run [benchmark]
   ./benchmarks/bench.sh compare <branch1> <branch2>
   
   **********
   Examples:
   **********
   # Create the datasets for all benchmarks in 
/Users/andrewlamb/Software/arrow-datafusion/benchmarks/data
   ./bench.sh data
   
   # Run the 'tpch' benchmark on the datafusion checkout in 
/source/arrow-datafusion
   DATAFASION_DIR=/source/arrow-datafusion ./bench.sh run tpch
   
   **********
   * Commands
   **********
   data:         Generates data needed for benchmarking
   run:          Runs the named benchmark
   compare:      Comares results from benchmark runs
   
   **********
   * Benchmarks
   **********
   all(default): Data/Run/Compare for all benchmarks
   tpch:                   TPCH inspired benchmark on Scale Factor (SF) 1 
(~1GB), single parquet file per table
   tpch_mem:               TPCH inspired benchmark on Scale Factor (SF) 1 
(~1GB), query from memory
   tpch10:                 TPCH inspired benchmark on Scale Factor (SF) 10 
(~10GB), single parquet file per table
   tpch10_mem:             TPCH inspired benchmark on Scale Factor (SF) 10 
(~10GB), query from memory
   parquet:                Benchmark of parquet reader's filtering speed
   sort:                   Benchmark of sorting speed
   clickbench_1:           ClickBench queries against a single parquet file
   clickbench_partitioned: ClickBench queries against a partitioned (100 files) 
parquet
   
   **********
   * Supported Configuration (Environment Variables)
   **********
   DATA_DIR        directory to store datasets
   CARGO_COMMAND   command that runs the benchmark binary
   DATAFASION_DIR  directory to use (default 
/Users/andrewlamb/Software/arrow-datafusion/benchmarks/..)
   ```
   
   </p>
   </details> 
   
   ### Describe the solution you'd like
   
   I would like to add a new benchmark to `bench.sh` that uses the same dataset 
but has different queries than the existing
   
   ```shell
   $ ./benchmarks/bench.sh run clickbench_extended
   ```
   
   The new queries should be
   1. realistic (can write an English sentence explaining the quantity the 
compute and how it might be used)
   2. Reflect some query pattern
   
   Here is an example from 
https://github.com/apache/arrow-datafusion/pull/8849#issuecomment-1890482901
   
   
   ## Query: Distinct counts
   
   Query Explanation: Data exploration: understand the qualities of the data in 
`hits.parquet`
   Query Properties: multiple count distinct aggregates on string datatypes
   
   ```sql
   ❯ SELECT
     COUNT(DISTINCT "SearchPhrase"),
     COUNT(DISTINCT "MobilePhone"),
     COUNT(DISTINCT "MobilePhoneModel")
   FROM 'hits.parquet';
   ```
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Add 'clickbench_extended' benchmark [arrow-datafusion]

Reply via email to