GitHub user gatorsmile opened a pull request:
https://github.com/apache/spark/pull/11745
[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and
PushPredicateThroughProject
#### What changes were proposed in this pull request?
Now, `ColumnPruning` and `PushPredicateThroughProject` reverse each other's
effect. Although it will not cause the max iteration now, some queries are not
optimized to the best.
For example, in the following query,
```scala
val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
val originalQuery =
input.select('a, 'b, 'c, 'd,
WindowExpression(
AggregateExpression(Count('b), Complete, isDistinct = false),
WindowSpecDefinition( 'a :: Nil,
SortOrder('b, Ascending) :: Nil,
UnspecifiedFrame)).as('window)).where('window > 1).select('a,
'c)
```
After multiple iteration of two rules of {{ColumnPruning}} and
{{PushPredicateThroughProject}}, the optimized plan we generated is like:
```
Project [a#0,c#0]
+- Filter (window#0L > cast(1 as bigint))
+- Project [a#0,c#0,window#0L]
+- Window [(count(b#0),mode=Complete,isDistinct=false)
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
+- LocalRelation [a#0,b#0,c#0,d#0]
```
However, the expected optimized plan should be like:
```
Project [a#0,c#0]
+- Filter (window#0L > cast(1 as bigint))
+- Project [a#0,c#0,window#0L]
+- Window [(count(b#0),mode=Complete,isDistinct=false)
windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
+- Project [a#0,b#0,c#0]
+- LocalRelation [a#0,b#0,c#0,d#0]
```
#### How was this patch tested?
The existing test cases already expose the problem, but we need to add more
regression tests to ensure the future code changes will not break it.
TODO: add more test cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gatorsmile/spark
predicatePushDownOverColumnPruning
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11745.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11745
----
commit c6221a4b8985ff92c425899fd48a3845ec73eb38
Author: gatorsmile <[email protected]>
Date: 2016-03-15T19:06:32Z
Merge remote-tracking branch 'upstream/master' into
predicatePushDownOverColumnPruning
commit c21748aa5b3c08d25d878421f1465b9ea4e20371
Author: gatorsmile <[email protected]>
Date: 2016-03-16T00:06:03Z
address the conflicts of two rules: PushPredicateThroughProject and
PushProjectThroughFilter.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]