[GitHub] [spark] copperybean opened a new pull request, #42495: [SPARK-44812][SQL] Push filters through join generated from intersect

via GitHub Tue, 15 Aug 2023 00:23:01 -0700


copperybean opened a new pull request, #42495:
URL: https://github.com/apache/spark/pull/42495


   Currently, filters can not be pushed through join generated from intersect 
for two types of SQLs
   1. `select a from (select a from tl intersect select x from tr) where a > 
123 `
   The physical plan is
   ```
   AdaptiveSparkPlan isFinalPlan=false
   +- HashAggregate(keys=[a#8L], functions=[])
      +- Exchange hashpartitioning(a#8L, 200), ENSURE_REQUIREMENTS, 
[plan_id=133]
         +- HashAggregate(keys=[a#8L], functions=[])
            +- BroadcastHashJoin [coalesce(a#8L, 0), isnull(a#8L)], 
[coalesce(x#24L, 0), isnull(x#24L)], LeftSemi, BuildRight, false
               :- Filter (isnotnull(a#8L) AND (a#8L > 123))
               :  +- FileScan json [a#8L] ...
               +- BroadcastExchange 
HashedRelationBroadcastMode(List(coalesce(input[0, bigint, true], 0), 
isnull(input[0, bigint, true])),false), [plan_id=129]
                  +- FileScan json [x#24L] ... 
   ```
   We can find the filter `a > 123` is not pushed to right table.
   
   2. `select a from (select a from tl intersect select x from tr) join trr on 
a = y`
   The physical plan is 
   ```
   *(3) Project [a#8L]
   +- *(3) BroadcastHashJoin [a#8L], [y#114L], Inner, BuildRight, false
      :- *(3) HashAggregate(keys=[a#8L], functions=[])
      :  +- Exchange hashpartitioning(a#8L, 200), ENSURE_REQUIREMENTS, 
[plan_id=506]
      :     +- *(1) HashAggregate(keys=[a#8L], functions=[])
      :        +- *(1) BroadcastHashJoin [coalesce(a#8L, 0), isnull(a#8L)], 
[coalesce(x#24L, 0), isnull(x#24L)], LeftSemi, BuildRight, false
      :           :- *(1) Filter isnotnull(a#8L)
      :           :  +- FileScan json [a#8L] ...
      :           +- BroadcastExchange 
HashedRelationBroadcastMode(List(coalesce(input[0, bigint, true], 0), 
isnull(input[0, bigint, true])),false), [plan_id=490]
      :              +- FileScan json [x#24L] ...
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
false]),false), [plan_id=512]
         +- *(2) Filter isnotnull(y#114L)
            +- FileScan json [y#114L] ...
   ```
   There should be a filter `isnotnull( x )` for table tr, while it's not 
pushed down.
   
   ### What changes were proposed in this pull request?
   For 1st SQL, the filter can not be pushed down through `EqualNullSafe` in 
optimizer rule `InferFiltersFromConstraints`. This can be fixed in method 
`ConstraintHelper.inferAdditionalConstraints`.
   For 2nd SQL, after rules `InferFiltersFromConstraints` and 
`PushDownPredicates`, the filter can only be pushed to left side of join, so 
another `InferFiltersFromConstraints` is needed to push the filter to right 
side of join.
   
   ### Why are the changes needed?
   Huge amount of data may be filtered before expensive join operation.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Unit test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] copperybean opened a new pull request, #42495: [SPARK-44812][SQL] Push filters through join generated from intersect

Reply via email to