[GitHub] spark pull request: [SPARK-5182] [SQL] Partitioning support for th...

liancheng Thu, 30 Apr 2015 16:24:26 -0700

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5526#discussion_r29479873
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala 
---
    @@ -63,6 +101,122 @@ private[sql] object DataSourceStrategy extends 
Strategy {
         case _ => Nil
       }
     
    +  private def buildPartitionedTableScan(
    +      logicalRelation: LogicalRelation,
    +      projections: Seq[NamedExpression],
    +      filters: Seq[Expression],
    +      partitionColumns: StructType,
    +      partitions: Array[Partition]) = {
    +    val output = projections.map(_.toAttribute)
    +    val relation = logicalRelation.relation.asInstanceOf[FSBasedRelation]
    +    val dataSchema = relation.dataSchema
    +
    +    // Builds RDD[Row]s for each selected partition.
    +    val perPartitionRows = partitions.map { case 
Partition(partitionValues, dir) =>
    +      // Paths to all data files within this partition
    +      val dataFilePaths = {
    +        val dirPath = new Path(dir)
    +        val fs = dirPath.getFileSystem(SparkHadoopUtil.get.conf)
    +        fs.listStatus(dirPath)
    +          .map(_.getPath)
    +          .filter { path =>
    +            val name = path.getName
    +            name.startsWith("_") || name.startsWith(".")
    +          }
    +          .map(fs.makeQualified(_).toString)
    +      }
    +
    +      // The table scan operator (PhysicalRDD) which retrieves required 
columns from data files.
    +      // Notice that the schema of data files, represented by 
`relation.dataSchema`, may contain
    +      // some partition column(s). Those partition columns that are only 
encoded in partition
    +      // directory paths are not covered by this table scan operator.
    +      val scan =
    +        pruneFilterProject(
    +          logicalRelation,
    +          projections,
    +          filters,
    +          (requiredColumns, filters) => {
    +            // Only columns appear in actual data, which possibly include 
some partition column(s)
    +            relation.buildScan(
    +              requiredColumns.filter(dataSchema.fieldNames.contains),
    --- End diff --
    
    Good catch! Simply changing the filter condition would be enough.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5182] [SQL] Partitioning support for th...

Reply via email to