2010YOUY01 opened a new pull request, #16819:
URL: https://github.com/apache/datafusion/pull/16819
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax. For example `Closes #123`
indicates that this PR will close issue #123.
-->
- NA
## Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in
the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your
changes and offer better suggestions for fixes.
-->
Now, NLJ operator still has some room to improve performance and efficiency
(less memory consumption), and it has attracted interest from the community (cc
@jonathanc-n ) recently.
Inspired by the benchmarks used by @UBarney in
https://github.com/apache/datafusion/pull/16443#issuecomment-2986400295, this
PR added a similar micro-benchmark for NLJ into the DF benchmark suite.
## What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it is
sometimes worth providing a summary of the individual changes in this PR.
-->
A new micro-benchmark for NLJ in the benchmark suite (`./bench.sh ...`)
The queries and the varied query characteristics can be found in the src.
The special (semi/anti/mark) joins are not included, I'm not sure what's the
typical workload for those joins.
The bench runner has a validation step to ensure the queries are using NLJ
in physical plan.
Also, the optimizer currently does not reorder joins, so the execution order
follows the join order in the SQL string. (I wish there were an option to
explicitly enforce this behavior.)
## Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example, are
they covered by existing tests)?
-->
I tested it locally:
<details>
<summary> Bench Run </summary>
```sh
yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (nlj-bench *)> ./bench.sh
data nlj
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: nlj
DATA_DIR: /Users/yongting/Code/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
NLJ benchmark does not require data generation
yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (nlj-bench *)> ./bench.sh
run nlj
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: nlj
QUERY: All
DATAFUSION_DIR: /Users/yongting/Code/datafusion/benchmarks/..
BRANCH_NAME: nlj-bench
DATA_DIR: /Users/yongting/Code/datafusion/benchmarks/data
RESULTS_DIR: /Users/yongting/Code/datafusion/benchmarks/results/nlj-bench
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
RESULTS_FILE:
/Users/yongting/Code/datafusion/benchmarks/results/nlj-bench/nlj.json
Running nlj benchmark...
+ cargo run --release --bin dfbench -- nlj --iterations 5 -o
/Users/yongting/Code/datafusion/benchmarks/results/nlj-bench/nlj.json
Compiling ...
Running NLJ benchmarks with the following options: RunOpt {
query_name: None,
common: CommonOpt {
iterations: 5,
partitions: None,
batch_size: None,
mem_pool_type: "fair",
memory_limit: None,
sort_spill_reservation_bytes: None,
debug: false,
},
output_path: Some(
"/Users/yongting/Code/datafusion/benchmarks/results/nlj-bench/nlj.json",
),
}
Query q1 iteration 0 returned 100000 rows in 287.247375ms
Query q1 iteration 1 returned 100000 rows in 285.833ms
Query q1 iteration 2 returned 100000 rows in 245.063084ms
Query q1 iteration 3 returned 100000 rows in 206.90325ms
Query q1 iteration 4 returned 100000 rows in 207.072917ms
Query q2 iteration 0 returned 20000000 rows in 254.630083ms
Query q2 iteration 1 returned 20000000 rows in 246.942708ms
Query q2 iteration 2 returned 20000000 rows in 239.448709ms
Query q2 iteration 3 returned 20000000 rows in 240.270583ms
Query q2 iteration 4 returned 20000000 rows in 251.336291ms
Query q3 iteration 0 returned 90000000 rows in 446.120291ms
Query q3 iteration 1 returned 90000000 rows in 453.314375ms
Query q3 iteration 2 returned 90000000 rows in 358.530208ms
Query q3 iteration 3 returned 90000000 rows in 394.261916ms
Query q3 iteration 4 returned 90000000 rows in 453.936083ms
Query q4 iteration 0 returned 180000000 rows in 1.118616083s
Query q4 iteration 1 returned 180000000 rows in 1.037793375s
Query q4 iteration 2 returned 180000000 rows in 952.131541ms
Query q4 iteration 3 returned 180000000 rows in 962.842834ms
Query q4 iteration 4 returned 180000000 rows in 1.056383333s
Query q5 iteration 0 returned 2000000 rows in 572.229083ms
Query q5 iteration 1 returned 2000000 rows in 611.111917ms
Query q5 iteration 2 returned 2000000 rows in 836.5735ms
Query q5 iteration 3 returned 2000000 rows in 622.4575ms
Query q5 iteration 4 returned 2000000 rows in 579.447708ms
Query q6 iteration 0 returned 2000000 rows in 9.371356959s
Query q6 iteration 1 returned 2000000 rows in 6.032997291s
Query q6 iteration 2 returned 2000000 rows in 5.728677125s
Query q6 iteration 3 returned 2000000 rows in 6.046709958s
Query q6 iteration 4 returned 2000000 rows in 5.766419917s
Query q7 iteration 0 returned 2000000 rows in 790.340125ms
Query q7 iteration 1 returned 2000000 rows in 654.001709ms
Query q7 iteration 2 returned 2000000 rows in 860.251ms
Query q7 iteration 3 returned 2000000 rows in 531.644959ms
Query q7 iteration 4 returned 2000000 rows in 525.802541ms
Query q8 iteration 0 returned 2000000 rows in 9.162710916s
Query q8 iteration 1 returned 2000000 rows in 5.64653225s
Query q8 iteration 2 returned 2000000 rows in 5.505889417s
Query q8 iteration 3 returned 2000000 rows in 5.58156175s
Query q8 iteration 4 returned 2000000 rows in 5.635720625s
Query q9 iteration 0 returned 900000 rows in 875.642083ms
Query q9 iteration 1 returned 900000 rows in 655.309166ms
Query q9 iteration 2 returned 900000 rows in 653.490167ms
Query q9 iteration 3 returned 900000 rows in 655.535958ms
Query q9 iteration 4 returned 900000 rows in 655.982292ms
Query q10 iteration 0 returned 810000000 rows in 2.26567725s
Query q10 iteration 1 returned 810000000 rows in 2.690937042s
Query q10 iteration 2 returned 810000000 rows in 3.48998175s
Query q10 iteration 3 returned 810000000 rows in 3.145351041s
Query q10 iteration 4 returned 810000000 rows in 5.294884292s
+ set +x
Done
yongting@Yongtings-MacBook-Pro-2 ~/C/d/benchmarks (nlj-bench *)> ./bench.sh
compare nlj-bench nlj-bench
Comparing nlj-bench and nlj-bench
--------------------
--------------------
Benchmark nlj.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query ┃ nlj-bench ┃ nlj-bench ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery q1 │ 206.90 ms │ 206.90 ms │ no change │
│ QQuery q2 │ 239.45 ms │ 239.45 ms │ no change │
│ QQuery q3 │ 358.53 ms │ 358.53 ms │ no change │
│ QQuery q4 │ 952.13 ms │ 952.13 ms │ no change │
│ QQuery q5 │ 572.23 ms │ 572.23 ms │ no change │
│ QQuery q6 │ 5728.68 ms │ 5728.68 ms │ no change │
│ QQuery q7 │ 525.80 ms │ 525.80 ms │ no change │
│ QQuery q8 │ 5505.89 ms │ 5505.89 ms │ no change │
│ QQuery q9 │ 653.49 ms │ 653.49 ms │ no change │
│ QQuery q10 │ 2265.68 ms │ 2265.68 ms │ no change │
└──────────────┴────────────┴────────────┴───────────┘
```
</details>
## Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]