mikhailnik-db commented on code in PR #54297:
URL: https://github.com/apache/spark/pull/54297#discussion_r2817921488
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala:
##########
@@ -564,6 +564,81 @@ case class ListAgg(
false
}
+ /**
+ * Determines whether the order mismatch between [[child]] and
[[orderExpressions]] is due to
+ * a cast, and if so, whether that cast is safe for DISTINCT deduplication.
Review Comment:
> I think the general theory here is: if ordering key is col and the input
expression is transform(col), we don't need to save order-value, if the
transformation can preserve the equality.
> So a cleaner solution is to add an optimizer rule to match ListAgg, and
replace its ordering key with the input expression, if the transformation
preserves the equality.
It won't work out of box, because even if the transformation preserves the
equality, it does not necessarily preserve the ordering. eg, int -> string
changes the order from numeric to lexicographic.
We can do the opposite: save `col` and `transform` and do the transformation
on the fly during execution.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]