minyyy commented on code in PR #56546:
URL: https://github.com/apache/spark/pull/56546#discussion_r3423131073
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -2312,10 +2312,29 @@ case class OneRowRelation() extends LeafNode {
}
}
+/**
+ * The original recipe behind a [[Deduplicate]] /
[[DeduplicateWithinWatermark]] node, set by the
+ * `ResolveDeduplicate` analyzer rule and retained so a streaming query can
recompute its key
+ * attributes at query start in the ordering pinned in the offset log (see
+ * `ResolveDeduplicate.computeKeys`). A `None` spec on a node means it was not
built from
+ * `dropDuplicates*` (e.g. an internally/test-constructed node) and its keys
must NOT be recomputed.
+ *
+ * @param subset the user-requested subset of column names (ignored when
`allColumnsAsKeys`).
+ * @param allColumnsAsKeys when true, every column of the child is a
deduplication key.
+ * @param viaSparkClassic whether this was built via Spark Classic
(`Dataset.dropDuplicates*`, true)
+ * or Spark Connect (`transformDeduplicate`, false). Only consulted when
recomputing the keys in
+ * the legacy order, where the two engines historically differed. See
SPARK-57489.
+ */
+case class DeduplicateSpec(
+ subset: Seq[String],
Review Comment:
nit: Rename `subset` to some better name like `keyColumnNames`? And ideally,
I think using `Either[Seq[String], Boolean]` or use case objects is better here
since they explicitly convey the "either-or" but I won't nit pick too much on
the second point.
```scala
sealed trait DeduplicateKeySpec
case object DeduplicateKeyColumns(columnNames: Seq[String]) extends
DeduplicateKeySpec
case object DeduplicateAllColumnsAsKey extends DeduplicateKeySpec
case class DeduplicateSpec(
keySpec: DeduplicateKeySpec,
viaSparkClassic: Boolean
)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]