copperybean opened a new pull request, #42495:
URL: https://github.com/apache/spark/pull/42495
Currently, filters can not be pushed through join generated from intersect
for two types of SQLs
1. `select a from (select a from tl intersect select x from tr) where a >
123 `
The physical plan is
```
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[a#8L], functions=[])
+- Exchange hashpartitioning(a#8L, 200), ENSURE_REQUIREMENTS,
[plan_id=133]
+- HashAggregate(keys=[a#8L], functions=[])
+- BroadcastHashJoin [coalesce(a#8L, 0), isnull(a#8L)],
[coalesce(x#24L, 0), isnull(x#24L)], LeftSemi, BuildRight, false
:- Filter (isnotnull(a#8L) AND (a#8L > 123))
: +- FileScan json [a#8L] ...
+- BroadcastExchange
HashedRelationBroadcastMode(List(coalesce(input[0, bigint, true], 0),
isnull(input[0, bigint, true])),false), [plan_id=129]
+- FileScan json [x#24L] ...
```
We can find the filter `a > 123` is not pushed to right table.
2. `select a from (select a from tl intersect select x from tr) join trr on
a = y`
The physical plan is
```
*(3) Project [a#8L]
+- *(3) BroadcastHashJoin [a#8L], [y#114L], Inner, BuildRight, false
:- *(3) HashAggregate(keys=[a#8L], functions=[])
: +- Exchange hashpartitioning(a#8L, 200), ENSURE_REQUIREMENTS,
[plan_id=506]
: +- *(1) HashAggregate(keys=[a#8L], functions=[])
: +- *(1) BroadcastHashJoin [coalesce(a#8L, 0), isnull(a#8L)],
[coalesce(x#24L, 0), isnull(x#24L)], LeftSemi, BuildRight, false
: :- *(1) Filter isnotnull(a#8L)
: : +- FileScan json [a#8L] ...
: +- BroadcastExchange
HashedRelationBroadcastMode(List(coalesce(input[0, bigint, true], 0),
isnull(input[0, bigint, true])),false), [plan_id=490]
: +- FileScan json [x#24L] ...
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint,
false]),false), [plan_id=512]
+- *(2) Filter isnotnull(y#114L)
+- FileScan json [y#114L] ...
```
There should be a filter `isnotnull( x )` for table tr, while it's not
pushed down.
### What changes were proposed in this pull request?
For 1st SQL, the filter can not be pushed down through `EqualNullSafe` in
optimizer rule `InferFiltersFromConstraints`. This can be fixed in method
`ConstraintHelper.inferAdditionalConstraints`.
For 2nd SQL, after rules `InferFiltersFromConstraints` and
`PushDownPredicates`, the filter can only be pushed to left side of join, so
another `InferFiltersFromConstraints` is needed to push the filter to right
side of join.
### Why are the changes needed?
Huge amount of data may be filtered before expensive join operation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]