zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082712617
Updated parquet result from my local using the 1e8 dataset, it even faster: ```rust ./bench.sh run h2o_medium_join_parquet *************************** DataFusion Benchmark Script COMMAND: run BENCHMARK: h2o_medium_join_parquet QUERY: All DATAFUSION_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/.. BRANCH_NAME: support_parquet_for_h2o DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data RESULTS_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o CARGO_COMMAND: cargo run --release PREFER_HASH_JOIN: true *************************** RESULTS_FILE: /Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o/h2o_join.json Running h2o join benchmark... + cargo run --release --bin dfbench -- h2o --iterations 3 --join-paths /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_NA_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/join.sql -o /Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o/h2o_join.json Finished `release` profile [optimized] target(s) in 0.34s Running `/Users/zhuqi/arrow-datafusion/target/aarch64-apple-darwin/release/dfbench h2o --iterations 3 --join-paths /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_NA_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/join.sql -o /Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o/h2o_join.json` Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 3, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/h2o/join.sql", path: "benchmarks/data/h2o/G1_1e7_1e7_100_0.csv", join_paths: "/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_NA_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.parquet,/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.parquet", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/support_parquet_for_h2o/h2o_join.json") } Q1: SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1; Query 1 iteration 1 took 6.2 ms and returned 90 rows Query 1 iteration 2 took 0.7 ms and returned 90 rows Query 1 iteration 3 took 0.6 ms and returned 90 rows Query 1 avg time: 2.51 ms Q2: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x INNER JOIN medium ON x.id2 = medium.id2; Query 2 iteration 1 took 5.4 ms and returned 89 rows Query 2 iteration 2 took 4.1 ms and returned 89 rows Query 2 iteration 3 took 4.4 ms and returned 89 rows Query 2 avg time: 4.64 ms Q3: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x LEFT JOIN medium ON x.id2 = medium.id2; Query 3 iteration 1 took 4.1 ms and returned 100 rows Query 3 iteration 2 took 3.8 ms and returned 100 rows Query 3 iteration 3 took 4.2 ms and returned 100 rows Query 3 avg time: 4.02 ms Q4: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x JOIN medium ON x.id5 = medium.id5; Query 4 iteration 1 took 3.0 ms and returned 89 rows Query 4 iteration 2 took 2.9 ms and returned 89 rows Query 4 iteration 3 took 2.8 ms and returned 89 rows Query 4 avg time: 2.90 ms Q5: SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN large ON x.id3 = large.id3; Query 5 iteration 1 took 468.4 ms and returned 92 rows Query 5 iteration 2 took 464.7 ms and returned 92 rows Query 5 iteration 3 took 449.2 ms and returned 92 rows Query 5 avg time: 460.75 ms + set +x Done ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org