[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

viirya Fri, 22 Jun 2018 02:20:05 -0700

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16677#discussion_r197388872
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
    @@ -193,6 +193,16 @@ case object SinglePartition extends Partitioning {
       }
     }
     
    +/**
    + * Represents a partitioning where rows are only serialized/deserialized 
locally. The number
    + * of partitions are not changed and also the distribution of rows. This 
is mainly used to
    + * obtain some statistics of map tasks such as number of outputs.
    + */
    +case class LocalPartitioning(orgPartition: Partitioning, numPartitions: 
Int) extends Partitioning {
    --- End diff --
    
    Not sure if I understand correctly. We explicitly specify this 
`LocalPartitioning` when doing global limit and submit a map stage using this 
partitioner. Why we possibly hit a sort based shuffle?
    
    > You basically only need to write to a single file and your done.
    I think this is what we want. I specify the same partition numbers for 
`LocalPartitioning`  as its child RDD and the rows in a partition all have the 
same partition id when using `LocalPartitioning`. Doesn't it make it to write 
to a single file?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

Reply via email to