andygrove opened a new issue, #3888:
URL: https://github.com/apache/datafusion-comet/issues/3888

   ### Describe the enhancement
   
   `array_distinct` is currently marked `Incompatible` because the DataFusion 
implementation sorts output elements, while Spark preserves insertion order of 
first occurrences.
   
   For example:
   - Spark: `array_distinct([2, 1, 2, 3])` → `[2, 1, 3]` (insertion order)
   - Comet: `array_distinct([2, 1, 2, 3])` → `[1, 2, 3]` (sorted)
   
   This difference means users must opt in with 
`spark.comet.expr.ArrayDistinct.allowIncompatible=true`.
   
   ### Proposed approach
   
   Implement a custom `array_distinct` in `native/spark-expr/` that uses a hash 
set for deduplication while preserving insertion order, matching Spark's 
behavior. This would allow upgrading the support level from `Incompatible` to 
`Compatible`.
   
   Key behaviors to match:
   - Preserve insertion order of first occurrence
   - De-duplicate NULL elements (keep only the first NULL)
   - De-duplicate NaN values (Float and Double)
   - Support all orderable element types
   
   ### Additional context
   
   The current implementation delegates to DataFusion's built-in 
`array_distinct` UDF (`datafusion-functions-nested`, `set_ops.rs`), which uses 
`RowConverter` with `.sorted().dedup()`. A Spark-compatible version would need 
to use `.dedup()` without sorting, or use a different deduplication strategy 
entirely.
   
   Test coverage was expanded in #3887.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to