[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

chrysan Mon, 19 Mar 2018 19:36:07 -0700

Github user chrysan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r175640958
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
    @@ -184,6 +189,43 @@ case class InsertIntoHadoopFsRelationCommand(
         Seq.empty[Row]
       }
     
    +  private def getBucketIdExpression(dataColumns: Seq[Attribute]): 
Option[Expression] = {
    +    bucketSpec.map { spec =>
    +      val bucketColumns = spec.bucketColumnNames.map(c => 
dataColumns.find(_.name == c).get)
    +      // Use `HashPartitioning.partitionIdExpression` as our bucket id 
expression, so that we can
    +      // guarantee the data distribution is same between shuffle and 
bucketed data source, which
    +      // enables us to only shuffle one side when join a bucketed table 
and a normal one.
    +      HashPartitioning(
    +        bucketColumns,
    +        spec.numBuckets,
    +        classOf[Murmur3Hash]
    +      ).partitionIdExpression
    +    }
    +  }
    +
    +  /**
    +   * How is `requiredOrdering` determined ?
    --- End diff --
    
    Why the definition of requiredOrdering here differs from that in 
InsertIntoHiveTable?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Reply via email to