zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082712617
Updated parquet result from my local using the 1e8 dataset, it even faster:
```rust
./bench.sh run h2o_medium_join_parquet
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: h2o_medium_join_parquet
QUERY: All
DATAFUSION_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/..
BRANCH_NAME: support_parquet_for_h2o
DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
RESULTS_DIR:
/Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
RESULTS_FILE:
/Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o/h2o_join.json
Running h2o join benchmark...
+ cargo run --release --bin dfbench -- h2o --iterations 3 --join-paths
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_NA_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.parquet
--queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/join.sql
-o
/Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o/h2o_join.json
Finished `release` profile [optimized] target(s) in 0.34s
Running
`/Users/zhuqi/arrow-datafusion/target/aarch64-apple-darwin/release/dfbench h2o
--iterations 3 --join-paths
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_NA_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.parquet
--queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/join.sql
-o
/Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o/h2o_join.json`
Running benchmarks with the following options: RunOpt { query: None, common:
CommonOpt { iterations: 3, partitions: None, batch_size: None, mem_pool_type:
"fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false },
queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/join.sql",
path: "benchmarks/data/h2o/G1_1e7_1e7_100_0.csv", join_paths:
"/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_NA_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.parquet",
output_path:
Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o/h2o_join.json")
}
Q1: SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, x.id5,
x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1;
Query 1 iteration 1 took 6.2 ms and returned 90 rows
Query 1 iteration 2 took 0.7 ms and returned 90 rows
Query 1 iteration 3 took 0.6 ms and returned 90 rows
Query 1 avg time: 2.51 ms
Q2: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as
xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6,
x.v1, medium.v2 FROM x INNER JOIN medium ON x.id2 = medium.id2;
Query 2 iteration 1 took 5.4 ms and returned 89 rows
Query 2 iteration 2 took 4.1 ms and returned 89 rows
Query 2 iteration 3 took 4.4 ms and returned 89 rows
Query 2 avg time: 4.64 ms
Q3: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as
xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6,
x.v1, medium.v2 FROM x LEFT JOIN medium ON x.id2 = medium.id2;
Query 3 iteration 1 took 4.1 ms and returned 100 rows
Query 3 iteration 2 took 3.8 ms and returned 100 rows
Query 3 iteration 3 took 4.2 ms and returned 100 rows
Query 3 avg time: 4.02 ms
Q4: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as
xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6,
x.v1, medium.v2 FROM x JOIN medium ON x.id5 = medium.id5;
Query 4 iteration 1 took 3.0 ms and returned 89 rows
Query 4 iteration 2 took 2.9 ms and returned 89 rows
Query 4 iteration 3 took 2.8 ms and returned 89 rows
Query 4 avg time: 2.90 ms
Q5: SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as
largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5
as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN
large ON x.id3 = large.id3;
Query 5 iteration 1 took 468.4 ms and returned 92 rows
Query 5 iteration 2 took 464.7 ms and returned 92 rows
Query 5 iteration 3 took 449.2 ms and returned 92 rows
Query 5 avg time: 460.75 ms
+ set +x
Done
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]