andygrove opened a new issue, #3174: URL: https://github.com/apache/datafusion-comet/issues/3174
## Summary `array_distinct` is marked as `Incompatible` in Comet, but the specific incompatibility is not documented. This issue tracks documenting and potentially fixing the behavior difference. ## Spark Specification According to Spark's `array_distinct` behavior: - Removes duplicate elements from an array **while preserving the order of first occurrence** - Handles null values by keeping only the first null encountered - Uses `SQLOpenHashSet` for O(1) duplicate detection Example: ```sql SELECT array_distinct(array(3, 1, 2, 1, 3)); -- Spark returns: [3, 1, 2] (preserves insertion order) ``` ## Current Comet Behavior The test file `CometArrayExpressionSuite.scala` includes comments explaining the incompatibility: ```scala // The result needs to be in ascending order for checkSparkAnswerAndOperator to pass // because datafusion array_distinct sorts the elements and then removes the duplicates ``` And: ```scala // NULL needs to be the first element for checkSparkAnswerAndOperator to pass because // datafusion array_distinct sorts the elements and then removes the duplicates ``` DataFusion's `array_distinct`: - **Sorts** the elements before removing duplicates - This changes the output order compared to Spark Example: ```sql SELECT array_distinct(array(3, 1, 2, 1, 3)); -- DataFusion returns: [1, 2, 3] (sorted order, not insertion order) ``` ## Impact Users who depend on the order of elements in `array_distinct` output will see different results. ## Possible Solutions 1. **Custom Rust implementation** that uses a HashSet but preserves insertion order (like Spark) 2. **Document clearly** in compatibility matrix that element order differs 3. Keep as Incompatible and require `allow_incompatible=true` --- > **Note:** This issue was generated with AI assistance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
