[GitHub] spark pull request #21252: [SPARK-24193] Sort by disk when number of limit i...

viirya Sun, 06 May 2018 20:07:07 -0700

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21252#discussion_r186318646
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
    @@ -127,13 +127,36 @@ case class TakeOrderedAndProjectExec(
         projectList: Seq[NamedExpression],
         child: SparkPlan) extends UnaryExecNode {
     
    +  val sortInMemThreshold = conf.sortInMemForLimitThreshold
    +
    +  override def requiredChildDistribution: Seq[Distribution] = {
    +    if (limit < sortInMemThreshold) {
    +      super.requiredChildDistribution
    +    } else {
    +      Seq(AllTuples)
    +    }
    +  }
    +
    +  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
    +    if (limit < sortInMemThreshold) {
    +      super.requiredChildOrdering
    +    } else {
    +      Seq(sortOrder)
    +    }
    +  }
    --- End diff --
    
    It was only shuffling limit-ed rows from each partitions. The final sort 
was also only applied on limited-data form all partitions.
    
    Now it requires sorting and shuffling on whole data. I suspect that it'd 
degrade performance in the end.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21252: [SPARK-24193] Sort by disk when number of limit i...

Reply via email to