mbutrovich commented on code in PR #4713:
URL: https://github.com/apache/datafusion-comet/pull/4713#discussion_r3507247183
##########
spark/src/main/scala/org/apache/comet/serde/arrays.scala:
##########
@@ -377,13 +386,31 @@ object CometArrayExcept
}
}
-object CometArrayJoin extends CometExpressionSerde[ArrayJoin] with
CodegenDispatchFallback {
+object CometArrayJoin
+ extends CometExpressionSerde[ArrayJoin]
+ with CometTypeShim
+ with CodegenDispatchFallback {
private val incompatReason = "Null handling may differ from Spark"
+ private val unsupportedCollationReason =
+ "array_join on collated strings is not supported " +
+ "(https://github.com/apache/datafusion-comet/issues/2190)"
+
override def getIncompatibleReasons(): Seq[String] = Seq(incompatReason)
- override def getSupportLevel(expr: ArrayJoin): SupportLevel =
Incompatible(Some(incompatReason))
+ override def getUnsupportedReasons(): Seq[String] =
Seq(unsupportedCollationReason)
+
+ override def getSupportLevel(expr: ArrayJoin): SupportLevel = {
+ // Spark 4.0 widens ArrayJoin's input to StringTypeWithCollation; the
native array_to_string
+ // produces UTF8_BINARY semantics and does not propagate non-default
collations, so surface
+ // that as a fallback reason in EXPLAIN, mirroring CometArrayIntersect.
+ if (hasNonDefaultStringCollation(expr.array.dataType)) {
+ Unsupported(Some(unsupportedCollationReason))
Review Comment:
This is the one design point worth discussing. `CometArrayJoin` mixes in
`CodegenDispatchFallback`, so it has the option to report `Incompatible` and
route collated inputs through the JVM codegen dispatcher (Spark's own
`doGenCode`, which stays native and matches Spark exactly). Reporting
`Unsupported` instead forces the whole projection back to Spark.
The PR mirrors `CometArrayIntersect`, which also uses `Unsupported` for
collation, so it is internally consistent with that precedent. But it diverges
from `CometReverse`, which handles the same #2190 collation issue with
`Incompatible` and keeps execution native via the dispatcher. All three mix in
`CodegenDispatchFallback`.
Is `Unsupported` intentional here because the dispatcher genuinely cannot
produce a correctly-typed collated result for `array_join`, or is it just
following `array_intersect`? If the dispatcher can run it (as it does for
`reverse`), `Incompatible` would keep collated `array_join` native and line up
the three collation cases. If it truly cannot, then `Reverse`'s
`Incompatible`-for-collation is the odd one out and may deserve a look. Either
way it would help to reconcile the three so the reasoning is uniform.
Minor, same line: the collation check reads `expr.array.dataType`, while
`CometArrayIntersect` checks the output `expr.dataType`. For `array_join` these
are equivalent since Spark defines `ArrayJoin.dataType` as the array element
type, so this is correct. Using `expr.dataType` would just mirror the sibling
more literally. Take it or leave it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]