mbutrovich commented on code in PR #4713:
URL: https://github.com/apache/datafusion-comet/pull/4713#discussion_r3507247183


##########
spark/src/main/scala/org/apache/comet/serde/arrays.scala:
##########
@@ -377,13 +386,31 @@ object CometArrayExcept
   }
 }
 
-object CometArrayJoin extends CometExpressionSerde[ArrayJoin] with 
CodegenDispatchFallback {
+object CometArrayJoin
+    extends CometExpressionSerde[ArrayJoin]
+    with CometTypeShim
+    with CodegenDispatchFallback {
 
   private val incompatReason = "Null handling may differ from Spark"
 
+  private val unsupportedCollationReason =
+    "array_join on collated strings is not supported " +
+      "(https://github.com/apache/datafusion-comet/issues/2190)"
+
   override def getIncompatibleReasons(): Seq[String] = Seq(incompatReason)
 
-  override def getSupportLevel(expr: ArrayJoin): SupportLevel = 
Incompatible(Some(incompatReason))
+  override def getUnsupportedReasons(): Seq[String] = 
Seq(unsupportedCollationReason)
+
+  override def getSupportLevel(expr: ArrayJoin): SupportLevel = {
+    // Spark 4.0 widens ArrayJoin's input to StringTypeWithCollation; the 
native array_to_string
+    // produces UTF8_BINARY semantics and does not propagate non-default 
collations, so surface
+    // that as a fallback reason in EXPLAIN, mirroring CometArrayIntersect.
+    if (hasNonDefaultStringCollation(expr.array.dataType)) {
+      Unsupported(Some(unsupportedCollationReason))

Review Comment:
   This is the one design point worth discussing. `CometArrayJoin` mixes in 
`CodegenDispatchFallback`, so it has the option to report `Incompatible` and 
route collated inputs through the JVM codegen dispatcher (Spark's own 
`doGenCode`, which stays native and matches Spark exactly). Reporting 
`Unsupported` instead forces the whole projection back to Spark.
   
   The PR mirrors `CometArrayIntersect`, which also uses `Unsupported` for 
collation, so it is internally consistent with that precedent. But it diverges 
from `CometReverse`, which handles the same #2190 collation issue with 
`Incompatible` and keeps execution native via the dispatcher. All three mix in 
`CodegenDispatchFallback`.
   
   Is `Unsupported` intentional here because the dispatcher genuinely cannot 
produce a correctly-typed collated result for `array_join`, or is it just 
following `array_intersect`? If the dispatcher can run it (as it does for 
`reverse`), `Incompatible` would keep collated `array_join` native and line up 
the three collation cases. If it truly cannot, then `Reverse`'s 
`Incompatible`-for-collation is the odd one out and may deserve a look. Either 
way it would help to reconcile the three so the reasoning is uniform.
   
   Minor, same line: the collation check reads `expr.array.dataType`, while 
`CometArrayIntersect` checks the output `expr.dataType`. For `array_join` these 
are equivalent since Spark defines `ArrayJoin.dataType` as the array element 
type, so this is correct. Using `expr.dataType` would just mirror the sibling 
more literally. Take it or leave it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to