[GitHub] spark pull request #19828: [SPARK-22614] Dataset API: repartitionByRange(......

gatorsmile Mon, 27 Nov 2017 10:03:02 -0800

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19828#discussion_r153274811
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2747,9 +2755,41 @@ class Dataset[T] private[sql](
        * @since 2.0.0
        */
       @scala.annotation.varargs
    -  def repartition(partitionExprs: Column*): Dataset[T] = withTypedPlan {
    -    RepartitionByExpression(
    -      partitionExprs.map(_.expr), logicalPlan, 
sparkSession.sessionState.conf.numShufflePartitions)
    +  def repartition(partitionExprs: Column*): Dataset[T] = {
    +    repartition(sparkSession.sessionState.conf.numShufflePartitions, 
partitionExprs: _*)
    +  }
    +
    +  /**
    +   * Returns a new Dataset partitioned by the given partitioning 
expressions into
    +   * `numPartitions`. The resulting Dataset is range partitioned.
    +   *
    +   * @group typedrel
    +   * @since 2.3.0
    +   */
    +  @scala.annotation.varargs
    +  def repartitionByRange(numPartitions: Int, partitionExprs: Column*): 
Dataset[T] = withTypedPlan {
    +    val sortOrder: Seq[SortOrder] = partitionExprs.map { col =>
    +      col.expr match {
    +        case expr: SortOrder =>
    +          expr
    +        case expr: Expression =>
    --- End diff --
    
    What happened if we have a `SortOrder ` that is not in the root node of 
`expr`?
    
    ```Scala
        data1d.toDF("val").repartitionByRange(data1d.size, $"val".desc + 1)
          .select(spark_partition_id().as("id"), $"val").show()
    ```
    
    ```
    
    org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, 
tree:
    Exchange rangepartitioning((val#236 DESC NULLS LAST + 1) ASC NULLS FIRST, 
10)
    +- LocalTableScan [val#236]
    
        at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
        at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:116)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
    ```




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19828: [SPARK-22614] Dataset API: repartitionByRange(......

Reply via email to