andygrove opened a new pull request, #3674:
URL: https://github.com/apache/datafusion-comet/pull/3674
## Which issue does this PR close?
Closes #3645.
## Rationale for this change
DataFusion's `array_has_any` function treats `NULL == NULL`, so
`arrays_overlap(array(1, NULL), array(NULL, 2))` incorrectly returns `true`
instead of `null`. This violates Spark's three-valued null semantics for
`arrays_overlap`.
## What changes are included in this PR?
- Add a custom Spark-compatible `arrays_overlap` implementation in Rust
(`native/spark-expr/src/array_funcs/arrays_overlap.rs`) that intercepts the
`array_has_any` function name with correct null handling:
- `true` when arrays share a common non-null element
- `null` when no common non-null elements exist but either array contains
nulls
- `false` when no common elements and no nulls
- Type-specialized fast paths for primitive types (`HashSet<T::Native>`) and
strings (`HashSet<&str>`), with a `ScalarValue` fallback for complex types
- Dispatch and downcast hoisted to batch level; HashSet reused across rows
for efficiency
- Supports both `ListArray` and `LargeListArray` via `GenericListArray<O>`
- Remove `ignore` annotations from previously-failing SQL tests
- Add compatibility documentation in the user guide
## How are these changes tested?
- 6 Rust unit tests covering: overlap, no overlap, null-only overlap
returning null, null array, empty array, and a multi-row scenario matching the
issue reproduction
- Re-enabled 2 SQL-based integration tests that were previously ignored due
to this bug
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]