[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

hvanhovell Fri, 22 Jun 2018 03:44:57 -0700

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16677#discussion_r197410511
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
    @@ -193,6 +193,16 @@ case object SinglePartition extends Partitioning {
       }
     }
     
    +/**
    + * Represents a partitioning where rows are only serialized/deserialized 
locally. The number
    + * of partitions are not changed and also the distribution of rows. This 
is mainly used to
    + * obtain some statistics of map tasks such as number of outputs.
    + */
    +case class LocalPartitioning(orgPartition: Partitioning, numPartitions: 
Int) extends Partitioning {
    --- End diff --
    
    It will hit the sort based shuffle path as soon as the `numPartitions` > 
200 right? The problem is not that it will end up in the same shuffle file (it 
will), the (small) problem is that the sort based shuffle buffers rows and 
tries to sort them before writing them out. It is just a lot of unneeded 
complexity.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

Reply via email to