Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21252#discussion_r186318646
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -127,13 +127,36 @@ case class TakeOrderedAndProjectExec(
projectList: Seq[NamedExpression],
child: SparkPlan) extends UnaryExecNode {
+ val sortInMemThreshold = conf.sortInMemForLimitThreshold
+
+ override def requiredChildDistribution: Seq[Distribution] = {
+ if (limit < sortInMemThreshold) {
+ super.requiredChildDistribution
+ } else {
+ Seq(AllTuples)
+ }
+ }
+
+ override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+ if (limit < sortInMemThreshold) {
+ super.requiredChildOrdering
+ } else {
+ Seq(sortOrder)
+ }
+ }
--- End diff --
It was only shuffling limit-ed rows from each partitions. The final sort
was also only applied on limited-data form all partitions.
Now it requires sorting and shuffling on whole data. I suspect that it'd
degrade performance in the end.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]