GitHub user maryannxue opened a pull request:

    https://github.com/apache/spark/pull/21674

    [SPARK-24696][SQL] ColumnPruning rule fails to remove extra Project

    ## What changes were proposed in this pull request?
    
    The ColumnPruning rule tries adding an extra Project if an input node 
produces fields more than needed, but as a post-processing step, it needs to 
remove the lower Project in the form of "Project - Filter - Project" otherwise 
it would conflict with PushPredicatesThroughProject and would thus cause a 
infinite optimization loop. The current post-processing method is defined as:
    ```
      private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = 
plan transform {
        case p1 @ Project(_, f @ Filter(_, p2 @ Project(_, child)))
          if p2.outputSet.subsetOf(child.outputSet) =>
          p1.copy(child = f.copy(child = child))
      }
    ```
    This method works well when there is only one Filter but would not if 
there's two or more Filters. In this case, there is a deterministic filter and 
a non-deterministic filter so they stay as separate filter nodes and cannot be 
combined together.
    
    An simplified illustration of the optimization process that forms the 
infinite loop is shown below (F1 stands for the 1st filter, F2 for the 2nd 
filter, P for project, S for scan of relation, PredicatePushDown as abbrev. of 
PushPredicatesThroughProject):
    ```
                                 F1 - F2 - P - S
    PredicatePushDown      =>    F1 - P - F2 - S
    ColumnPruning          =>    F1 - P - F2 - P - S
                           =>    F1 - P - F2 - S        (Project removed)
    PredicatePushDown      =>    P - F1 - F2 - S
    ColumnPruning          =>    P - F1 - P - F2 - S
                           =>    P - F1 - P - F2 - P - S 
                           =>    P - F1 - F2 - P - S    (only one Project 
removed)
    RemoveRedundantProject =>    F1 - F2 - P - S        (goes back to the loop 
start)
    ```
    The fix should be when adding Projects, the rule applies top-down, but 
later when removing extra Projects, the process should go bottom-up to ensure 
all extra Projects can be matched.
    
    ## How was this patch tested?
    
    Added a optimization rule test in ColumnPruningSuite; and a end-to-end test 
in SQLQuerySuite.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maryannxue/spark spark-24696

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21674.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21674
    
----
commit 11fde8ba4b64416d863a69c5587c0db67ea61d0a
Author: maryannxue <maryannxue@...>
Date:   2018-06-29T21:27:40Z

    SPARK-24696 ColumnPruning rule fails to remove extra Project

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to