[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

chrysan Mon, 19 Mar 2018 00:46:52 -0700

Github user chrysan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19001#discussion_r175350408
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
    @@ -156,40 +144,14 @@ object FileFormatWriter extends Logging {
           statsTrackers = statsTrackers
         )
     
    -    // We should first sort by partition columns, then bucket id, and 
finally sorting columns.
    -    val requiredOrdering = partitionColumns ++ bucketIdExpression ++ 
sortColumns
    -    // the sort order doesn't matter
    -    val actualOrdering = plan.outputOrdering.map(_.child)
    -    val orderingMatched = if (requiredOrdering.length > 
actualOrdering.length) {
    -      false
    -    } else {
    -      requiredOrdering.zip(actualOrdering).forall {
    -        case (requiredOrder, childOutputOrder) =>
    -          requiredOrder.semanticEquals(childOutputOrder)
    -      }
    -    }
    -
         SQLExecution.checkSQLExecutionId(sparkSession)
     
         // This call shouldn't be put into the `try` block below because it 
only initializes and
         // prepares the job, any exception thrown from here shouldn't cause 
abortJob() to be called.
         committer.setupJob(job)
     
         try {
    -      val rdd = if (orderingMatched) {
    -        plan.execute()
    -      } else {
    -        // SPARK-21165: the `requiredOrdering` is based on the attributes 
from analyzed plan, and
    -        // the physical plan may have different attribute ids due to 
optimizer removing some
    -        // aliases. Here we bind the expression ahead to avoid potential 
attribute ids mismatch.
    -        val orderingExpr = requiredOrdering
    -          .map(SortOrder(_, Ascending))
    -          .map(BindReferences.bindReference(_, outputSpec.outputColumns))
    -        SortExec(
    --- End diff --
    
    Removing SortExec here and adding it in EnsureRequirements Strategy will 
have impact on many other DataWritingCommands which depends on 
FileFormatWriter, like CreateDataSourceTableAsSelectCommand. To fix it code 
changes are needed onto such DataWritingCommand implementations to export 
requiredDistribution and requiredOrdering.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

Reply via email to