Re: [PR] Add H2O.ai Database-like Ops benchmark to dfbench (groupby support) [datafusion]

via GitHub Sun, 05 Jan 2025 19:54:03 -0800


zhuqi-lucas commented on PR #13996:
URL: https://github.com/apache/datafusion/pull/13996#issuecomment-2572227494


   Also, updated, csv is supported now:
   
   ```rust
   ./benchmarks/bench.sh data h2o_small_csv
   ***************************
   DataFusion Benchmark Runner and Data Generator
   COMMAND: data
   BENCHMARK: h2o_small_csv
   DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
   CARGO_COMMAND: cargo run --release
   PREFER_HASH_JOIN: true
   ***************************
   Found Python version 3.13, which is suitable.
   Using Python command: /usr/local/bin/python3
   Installing falsa...
   Generating h2o test data in 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o with size=SMALL and format=CSV
   10000000 rows will be saved into: 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.csv
   
   An output data schema is the following:
   id1: string
   id2: string
   id3: string
   id4: int64
   id5: int64
   id6: int64
   v1: int64 not null
   v2: int64 not null
   v3: double not null
   
   An output format is CSV
   
   Batch mode is supported.
   In case of memory problems you can try to reduce a batch_size.
   
   
   Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04
   ```
   
   ```rust
   ./benchmarks/bench.sh run h2o_small_csv
   ***************************
   DataFusion Benchmark Script
   COMMAND: run
   BENCHMARK: h2o_small_csv
   DATAFUSION_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/..
   BRANCH_NAME: issue_7209
   DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
   RESULTS_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_7209
   CARGO_COMMAND: cargo run --release
   PREFER_HASH_JOIN: true
   ***************************
   RESULTS_FILE: 
/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_7209/h2o.json
   Running h2o benchmark...
       Finished `release` profile [optimized] target(s) in 0.30s
        Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench h2o 
--iterations 3 --path 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.csv 
--queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/groupby.sql 
-o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_7209/h2o.json`
   Running benchmarks with the following options: RunOpt { query: None, common: 
CommonOpt { iterations: 3, partitions: None, batch_size: 8192, debug: false }, 
queries_path: 
"/Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/groupby.sql", path: 
"/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.csv", 
output_path: 
Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_7209/h2o.json") }
   Q1: SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1;
   Query 1 iteration 1 took 131.4 ms and returned 100 rows
   Query 1 iteration 2 took 111.8 ms and returned 100 rows
   Query 1 iteration 3 took 108.0 ms and returned 100 rows
   Q2: SELECT id1, id2, SUM(v1) AS v1 FROM x GROUP BY id1, id2;
   Query 2 iteration 1 took 267.1 ms and returned 6321413 rows
   Query 2 iteration 2 took 240.0 ms and returned 6321413 rows
   Query 2 iteration 3 took 235.2 ms and returned 6321413 rows
   Q3: SELECT id3, SUM(v1) AS v1, AVG(v3) AS v3 FROM x GROUP BY id3;
   Query 3 iteration 1 took 187.3 ms and returned 100000 rows
   Query 3 iteration 2 took 204.2 ms and returned 100000 rows
   Query 3 iteration 3 took 218.2 ms and returned 100000 rows
   Q4: SELECT id4, AVG(v1) AS v1, AVG(v2) AS v2, AVG(v3) AS v3 FROM x GROUP BY 
id4;
   Query 4 iteration 1 took 145.2 ms and returned 100 rows
   Query 4 iteration 2 took 144.7 ms and returned 100 rows
   Query 4 iteration 3 took 128.9 ms and returned 100 rows
   Q5: SELECT id6, SUM(v1) AS v1, SUM(v2) AS v2, SUM(v3) AS v3 FROM x GROUP BY 
id6;
   Query 5 iteration 1 took 165.3 ms and returned 100000 rows
   Query 5 iteration 2 took 161.1 ms and returned 100000 rows
   Query 5 iteration 3 took 163.0 ms and returned 100000 rows
   Q6: SELECT id4, id5, MEDIAN(v3) AS median_v3, STDDEV(v3) AS sd_v3 FROM x 
GROUP BY id4, id5;
   Query 6 iteration 1 took 302.7 ms and returned 10000 rows
   Query 6 iteration 2 took 299.9 ms and returned 10000 rows
   Query 6 iteration 3 took 294.8 ms and returned 10000 rows
   Q7: SELECT id3, MAX(v1) - MIN(v2) AS range_v1_v2 FROM x GROUP BY id3;
   Query 7 iteration 1 took 181.5 ms and returned 100000 rows
   Query 7 iteration 2 took 171.4 ms and returned 100000 rows
   Query 7 iteration 3 took 189.5 ms and returned 100000 rows
   Q8: SELECT id6, largest2_v3 FROM (SELECT id6, v3 AS largest2_v3, 
ROW_NUMBER() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x WHERE 
v3 IS NOT NULL) sub_query WHERE order_v3 <= 2;
   Query 8 iteration 1 took 382.6 ms and returned 200000 rows
   Query 8 iteration 2 took 366.2 ms and returned 200000 rows
   Query 8 iteration 3 took 361.9 ms and returned 200000 rows
   Q9: SELECT id2, id4, POWER(CORR(v1, v2), 2) AS r2 FROM x GROUP BY id2, id4;
   Query 9 iteration 1 took 685.0 ms and returned 6320797 rows
   Query 9 iteration 2 took 711.7 ms and returned 6320797 rows
   Query 9 iteration 3 took 725.4 ms and returned 6320797 rows
   Q10: SELECT id1, id2, id3, id4, id5, id6, SUM(v3) AS v3, COUNT(*) AS count 
FROM x GROUP BY id1, id2, id3, id4, id5, id6;
   Query 10 iteration 1 took 583.5 ms and returned 10000000 rows
   Query 10 iteration 2 took 539.3 ms and returned 10000000 rows
   Query 10 iteration 3 took 560.9 ms and returned 10000000 rows
   Done
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add H2O.ai Database-like Ops benchmark to dfbench (groupby support) [datafusion]

Reply via email to