Github user henryr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21049#discussion_r181839715
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
---
@@ -307,6 +309,32 @@ object RemoveRedundantProject extends
Rule[LogicalPlan] {
}
}
+/**
+ * Remove [[Sort]] in subqueries that do not affect the set of rows
produced, only their
+ * order. Subqueries produce unordered sets of rows so sorting their
output is unnecessary.
+ */
+object RemoveSubquerySorts extends Rule[LogicalPlan] {
+
+ /**
+ * Removes all [[Sort]] operators from a plan that are accessible from
the root operator via
+ * 0 or more [[Project]], [[Filter]] or [[View]] operators.
+ */
+ private def removeTopLevelSorts(plan: LogicalPlan): LogicalPlan = {
+ plan match {
+ case Sort(_, _, child) => removeTopLevelSorts(child)
+ case Project(fields, child) => Project(fields,
removeTopLevelSorts(child))
+ case Filter(condition, child) => Filter(condition,
removeTopLevelSorts(child))
+ case View(tbl, output, child) => View(tbl, output,
removeTopLevelSorts(child))
+ case _ => plan
+ }
+ }
+
+ def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+ case Subquery(child) => Subquery(removeTopLevelSorts(child))
+ case SubqueryAlias(name, child) => SubqueryAlias(name,
removeTopLevelSorts(child))
--- End diff --
Yep, that's why I added the new rule just before `EliminateSubqueryAliases`
(which runs in the optimizer, as part of the 'finish analysis' batch). After
`EliminateSubqueryAliases` there doesn't seem to be any way to detect
subqueries.
Another approach I suppose would be to handle this like `SparkPlan`'s
`requiredChildOrdering` - if a parent doesn't require any ordering of the
child, (and the child is a `Sort` node), the child `Sort` should be dropped.
That seems like a more fundamental change though.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]