GitHub user dilipbiswal opened a pull request:
https://github.com/apache/spark/pull/23211
[SPARK-19712][SQL] Move PullupCorrelatedPredicates and
RewritePredicateSubquery after OptimizeSubqueries
Currently predicate subqueries (IN/EXISTS) are converted to Joins at the
end of optimizer in RewritePredicateSubquery. This change moves the rewrite
close to beginning of optimizer. The original idea was to keep the subquery
expressions in Filter form so that we can push them down as deep as possible.
One disadvantage is that, after the subqueries are rewritten in join form, they
are not subjected to further optimizations. In this change, we convert the
subqueries to join form early in the rewrite phase and then add logic to push
the left-semi and left-anti joins down like we do for normal filter ops. I can
think of the following advantages :
1. We will produce consistent optimized plans for subqueries written using
SQL dialect and data frame apis.
2. Will hopefully make it easier to do the next phase of de-correlations
when we opens up more cases of de-correlation. In this case, it would be
beneficial to expose the rewritten queries to all the other optimization rules.
3. We can now hopefully get-rid of PullupCorrelatedPredicates rule and
combine ths with RewritePredicateSubquery. I haven't tried it. Will take it on
a followup.
(P.S Thanks to Natt for his original work in
[here](https://github.com/apache/spark/pull/17520). I have based this pr on his
work)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dilipbiswal/spark SPARK-19712-NEW
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23211.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23211
----
commit f4bb126472eb5a808a3ae94bcfb59e0674e01217
Author: Dilip Biswal <dbiswal@...>
Date: 2018-12-03T22:06:24Z
[SPARK-19712] Move PullupCorrelatedPredicates and RewritePredicateSubquery
after OptimizeSubqueries
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]