[GitHub] spark pull request: [SPARK-5182] [SQL] [WIP] Partitioning support ...

liancheng Wed, 29 Apr 2015 21:22:29 -0700

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5526#discussion_r29402960
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala 
---
    @@ -53,6 +53,25 @@ private[sql] object DataSourceStrategy extends Strategy {
             filters,
             (a, _) => t.buildScan(a)) :: Nil
     
    +    case PhysicalOperation(projectList, filters, l @ LogicalRelation(t: 
FSBasedRelation)) =>
    +      val selectedPartitions = prunePartitions(filters, t.partitionSpec)
    +      val inputPaths = selectedPartitions.map(_.path).toArray
    +
    +      // Don't push down predicates that reference partition columns
    +      val pushedFilters = {
    +        val partitionColumnNames = 
t.partitionSpec.partitionColumns.map(_.name).toSet
    +        filters.filter { f =>
    +          val referencedColumnNames = f.references.map(_.name).toSet
    +          referencedColumnNames.intersect(partitionColumnNames).isEmpty
    +        }
    +      }
    +
    +      pruneFilterProject(
    +        l,
    +        projectList,
    +        pushedFilters,
    +        (a, f) => t.buildScan(a, f, inputPaths)) :: Nil
    --- End diff --
    
    Sorry, my bad, and thanks for pointing this out! Also realized this when 
migrating Parquet data source. I should populate partition values here. 
`t.buildScan` should only accept data file paths within a single selected 
partition. So for a table scan operation with `N` selected partitions:
    
    1. `t.buildScan` is called `N` times, resulting `N` RDDs
    2. Populate partition values accordingly to each RDD
    3. Union all RDDs to form the final result
    
    This is similar to what we've done in `HiveTableScan`.
    
    I'm updating this part.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5182] [SQL] [WIP] Partitioning support ...

Reply via email to