andygrove opened a new issue, #3174:
URL: https://github.com/apache/datafusion-comet/issues/3174

   ## Summary
   
   `array_distinct` is marked as `Incompatible` in Comet, but the specific 
incompatibility is not documented. This issue tracks documenting and 
potentially fixing the behavior difference.
   
   ## Spark Specification
   
   According to Spark's `array_distinct` behavior:
   - Removes duplicate elements from an array **while preserving the order of 
first occurrence**
   - Handles null values by keeping only the first null encountered
   - Uses `SQLOpenHashSet` for O(1) duplicate detection
   
   Example:
   ```sql
   SELECT array_distinct(array(3, 1, 2, 1, 3));
   -- Spark returns: [3, 1, 2] (preserves insertion order)
   ```
   
   ## Current Comet Behavior
   
   The test file `CometArrayExpressionSuite.scala` includes comments explaining 
the incompatibility:
   ```scala
   // The result needs to be in ascending order for checkSparkAnswerAndOperator 
to pass
   // because datafusion array_distinct sorts the elements and then removes the 
duplicates
   ```
   
   And:
   ```scala
   // NULL needs to be the first element for checkSparkAnswerAndOperator to 
pass because
   // datafusion array_distinct sorts the elements and then removes the 
duplicates
   ```
   
   DataFusion's `array_distinct`:
   - **Sorts** the elements before removing duplicates
   - This changes the output order compared to Spark
   
   Example:
   ```sql
   SELECT array_distinct(array(3, 1, 2, 1, 3));
   -- DataFusion returns: [1, 2, 3] (sorted order, not insertion order)
   ```
   
   ## Impact
   
   Users who depend on the order of elements in `array_distinct` output will 
see different results.
   
   ## Possible Solutions
   
   1. **Custom Rust implementation** that uses a HashSet but preserves 
insertion order (like Spark)
   2. **Document clearly** in compatibility matrix that element order differs
   3. Keep as Incompatible and require `allow_incompatible=true`
   
   ---
   
   > **Note:** This issue was generated with AI assistance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to