[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

hvanhovell Thu, 21 Jun 2018 05:43:23 -0700

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16677#discussion_r197117604
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
    @@ -193,6 +193,16 @@ case object SinglePartition extends Partitioning {
       }
     }
     
    +/**
    + * Represents a partitioning where rows are only serialized/deserialized 
locally. The number
    + * of partitions are not changed and also the distribution of rows. This 
is mainly used to
    + * obtain some statistics of map tasks such as number of outputs.
    + */
    +case class LocalPartitioning(orgPartition: Partitioning, numPartitions: 
Int) extends Partitioning {
    --- End diff --
    
    This might be expensive as soon as we hit the sort based shuffle. Perhaps 
we should carve out some specialized shuffle writing path for this at some 
point. You basically only need to write to a single file and your done.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

Reply via email to