[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23211#discussion_r240097479 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -984,6 +1002,28 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper { project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)) +// Similar to the above Filter over Project +// LeftSemi/LeftAnti over Project +case join @ Join(p @ Project(pList, grandChild), rightOp, LeftSemiOrAnti(joinType), joinCond) --- End diff -- Shall we create a new rule `PushdownLeftSemaOrAntiJoin`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23211#discussion_r240097255 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -649,13 +664,16 @@ object CollapseProject extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case p1 @ Project(_, p2: Project) => - if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) { + if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList) || +ScalarSubquery.hasScalarSubquery(p1.projectList) || +ScalarSubquery.hasScalarSubquery(p2.projectList)) { --- End diff -- why do we allow it before? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23211#discussion_r240092936 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -267,6 +267,17 @@ object ScalarSubquery { case _ => false }.isDefined } + + def hasScalarSubquery(e: Expression): Boolean = { +e.find { + case s: ScalarSubquery => true + case _ => false +}.isDefined + } + + def hasScalarSubquery(e: Seq[Expression]): Boolean = { +e.find(hasScalarSubquery(_)).isDefined --- End diff -- `e.exists(hasScalarSubquery)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicate...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/23211 [SPARK-19712][SQL] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries Currently predicate subqueries (IN/EXISTS) are converted to Joins at the end of optimizer in RewritePredicateSubquery. This change moves the rewrite close to beginning of optimizer. The original idea was to keep the subquery expressions in Filter form so that we can push them down as deep as possible. One disadvantage is that, after the subqueries are rewritten in join form, they are not subjected to further optimizations. In this change, we convert the subqueries to join form early in the rewrite phase and then add logic to push the left-semi and left-anti joins down like we do for normal filter ops. I can think of the following advantages : 1. We will produce consistent optimized plans for subqueries written using SQL dialect and data frame apis. 2. Will hopefully make it easier to do the next phase of de-correlations when we opens up more cases of de-correlation. In this case, it would be beneficial to expose the rewritten queries to all the other optimization rules. 3. We can now hopefully get-rid of PullupCorrelatedPredicates rule and combine ths with RewritePredicateSubquery. I haven't tried it. Will take it on a followup. (P.S Thanks to Natt for his original work in [here](https://github.com/apache/spark/pull/17520). I have based this pr on his work) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-19712-NEW Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23211.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23211 commit f4bb126472eb5a808a3ae94bcfb59e0674e01217 Author: Dilip Biswal Date: 2018-12-03T22:06:24Z [SPARK-19712] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org