Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/16677#discussion_r197388872
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
---
@@ -193,6 +193,16 @@ case object SinglePartition extends Partitioning {
}
}
+/**
+ * Represents a partitioning where rows are only serialized/deserialized
locally. The number
+ * of partitions are not changed and also the distribution of rows. This
is mainly used to
+ * obtain some statistics of map tasks such as number of outputs.
+ */
+case class LocalPartitioning(orgPartition: Partitioning, numPartitions:
Int) extends Partitioning {
--- End diff --
Not sure if I understand correctly. We explicitly specify this
`LocalPartitioning` when doing global limit and submit a map stage using this
partitioner. Why we possibly hit a sort based shuffle?
> You basically only need to write to a single file and your done.
I think this is what we want. I specify the same partition numbers for
`LocalPartitioning` as its child RDD and the rows in a partition all have the
same partition id when using `LocalPartitioning`. Doesn't it make it to write
to a single file?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]