[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

viirya Fri, 22 Jun 2018 05:28:13 -0700

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16677#discussion_r197430376
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
    @@ -193,6 +193,16 @@ case object SinglePartition extends Partitioning {
       }
     }
     
    +/**
    + * Represents a partitioning where rows are only serialized/deserialized 
locally. The number
    + * of partitions are not changed and also the distribution of rows. This 
is mainly used to
    + * obtain some statistics of map tasks such as number of outputs.
    + */
    +case class LocalPartitioning(orgPartition: Partitioning, numPartitions: 
Int) extends Partitioning {
    --- End diff --
    
    Ah. I see. Thanks for the clarifying. I agree that we might need to have 
specialized shuffle writing path at some point. Currently I think when we hit 
the sort based shuffle, this should not be worse performance than previous 
global limit operation. If you agree, I'd like to put it to some follow-ups.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

Reply via email to