wecharyu commented on code in PR #49027:
URL: https://github.com/apache/spark/pull/49027#discussion_r1896488038
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -958,6 +958,18 @@ case class Sort(
override protected def withNewChildInternal(newChild: LogicalPlan): Sort =
copy(child = newChild)
}
+/**
+ * Clustering data within the partition.
+ *
+ * @param cluster The clustering expressions
+ * @param child Child logical plan
+ */
+case class Clustering(cluster: Seq[SortOrder], child: LogicalPlan) extends
UnaryNode {
Review Comment:
The `requiredOrdering` contains both cluster expressions (dynamic partition
columns and bucket id) and sorts (sorting columns) :
https://github.com/apache/spark/blob/2c1c4d2614ae1ff902c244209f7ec3c79102d3e0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala#L178-L180
I agree that ordering doesn't matter in clustering, how about expanding the
`Clustering` as follows:
```scala
case class Clustering(
cluster: Seq[Expression],
sorts: Seq[SortOrder],
child: LogicalPlan) extends UnaryNode {
override def output: Seq[Attribute] = child.output
override def outputOrdering: Seq[SortOrder] = sorts
override protected def withNewChildInternal(newChild: LogicalPlan):
Clustering =
copy(child = newChild)
}
```
And the behavior of `Clustering` is that clustering rows by `clusters`, and
in each cluster sort the data by `sorts`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]