Re: [PR] [SPARK-50469][SQL] V1Writes should respect the output ordering [spark]

via GitHub Mon, 23 Dec 2024 23:36:06 -0800


wecharyu commented on code in PR #49027:
URL: https://github.com/apache/spark/pull/49027#discussion_r1896488038



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -958,6 +958,18 @@ case class Sort(
   override protected def withNewChildInternal(newChild: LogicalPlan): Sort = 
copy(child = newChild)
 }
 
+/**
+ * Clustering data within the partition.
+ *
+ * @param cluster The clustering expressions
+ * @param child   Child logical plan
+ */
+case class Clustering(cluster: Seq[SortOrder], child: LogicalPlan) extends 
UnaryNode {

Review Comment:
   The `requiredOrdering` contains both cluster expressions (dynamic partition 
columns and bucket id) and sorts (sorting columns) :
   
https://github.com/apache/spark/blob/2c1c4d2614ae1ff902c244209f7ec3c79102d3e0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala#L178-L180
   
   I agree that ordering doesn't matter in clustering, how about expanding the 
`Clustering` as follows:
   ```scala
   case class Clustering(
       cluster: Seq[Expression],
       sorts: Seq[SortOrder],
       child: LogicalPlan) extends UnaryNode {
     override def output: Seq[Attribute] = child.output
     override def outputOrdering: Seq[SortOrder] = sorts
     override protected def withNewChildInternal(newChild: LogicalPlan): 
Clustering =
       copy(child = newChild)
   }
   ```
   And the behavior of `Clustering` is that clustering rows by `clusters`, and 
in each cluster sort the data by `sorts`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50469][SQL] V1Writes should respect the output ordering [spark]

Reply via email to