[GitHub] spark pull request: [SQL] Improve column pruning in the optimizer.

rxin Thu, 10 Apr 2014 12:49:23 -0700

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/378#discussion_r11503704
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
    @@ -33,7 +33,56 @@ object Optimizer extends RuleExecutor[LogicalPlan] {
         Batch("Filter Pushdown", Once,
           CombineFilters,
           PushPredicateThroughProject,
    -      PushPredicateThroughInnerJoin) :: Nil
    +      PushPredicateThroughInnerJoin,
    +      ColumnPruning) :: Nil
    +}
    +
    +/**
    + * Attempts to eliminate the reading of unneeded columns from the query 
plan using the following
    + * transformations:
    + *
    + *  - Inserting Projections beneath the following operators:
    + *   - Aggregate
    + *   - Project <- Join
    + *  - Collapse adjacent projections, performing alias substitution.
    + */
    +object ColumnPruning extends Rule[LogicalPlan] {
    +  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    +    case a @ Aggregate(_, _, child) if (child.outputSet -- 
a.references).nonEmpty =>
    +      // Project away references that are not needed to calculate the 
required aggregates.
    +      a.copy(child = Project(a.references.toSeq, child))
    +
    +    case Project(projectList, Join(left, right, joinType, condition)) =>
    +      // Collect the list of off references required either above or to 
evaluate the condition.
    +      val allReferences: Set[Attribute] =
    +        projectList.flatMap(_.references).toSet ++ 
condition.map(_.references).getOrElse(Set.empty)
    +      /** Applies a projection when the child is producing unnecessary 
attributes */
    +      def prunedChild(c: LogicalPlan) =
    +        if ((allReferences.filter(c.outputSet.contains) -- 
c.outputSet).nonEmpty) {
    +          Project(allReferences.filter(c.outputSet.contains).toSeq, c)
    +        } else {
    +          c
    +        }
    +
    +      Project(projectList, Join(prunedChild(left), prunedChild(right), 
joinType, condition))
    +
    +    case Project(project1, Project(project2, child)) =>
    --- End diff --
    
    maybe project1list and project2list to be more consistent with the rest of 
the file



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SQL] Improve column pruning in the optimizer.

Reply via email to