yaooqinn opened a new pull request, #54034:
URL: https://github.com/apache/spark/pull/54034
### What changes were proposed in this pull request?
This PR adds support for the IGNORE NULLS and RESPECT NULLS clauses for
`array_agg` and `collect_list` aggregate functions.
### Why are the changes needed?
The SQL standard and many databases (PostgreSQL, Snowflake, DuckDB, etc.)
support the IGNORE NULLS / RESPECT NULLS syntax for aggregate functions.
Currently, Spark only supports this syntax for window functions like `first`,
`last`, `lead`, `lag`, and `nth_value`.
By adding this support to `array_agg` and `collect_list`, users can
explicitly control whether null values should be included in the resulting
array:
- `array_agg(col) IGNORE NULLS` - skips null values (default behavior)
- `array_agg(col) RESPECT NULLS` - includes null values in the result
### Implementation Details
1. Added `ignoreNulls: Boolean = true` parameter to `CollectList` class
2. `array_agg` now uses `CollectList` as they have identical behavior
3. Changed `UnresolvedFunction.ignoreNulls` from `Boolean` to
`Option[Boolean]` to distinguish between:
- `None`: no clause specified (use function's default)
- `Some(true)`: IGNORE NULLS explicitly specified
- `Some(false)`: RESPECT NULLS explicitly specified
4. Consolidated ignoreNulls resolution logic in `FunctionResolution` with
shared `resolveIgnoreNulls` and `applyIgnoreNulls` methods
### Does this PR introduce _any_ user-facing change?
Yes. Users can now use IGNORE NULLS / RESPECT NULLS with array_agg and
collect_list:
```sql
SELECT array_agg(col IGNORE NULLS) FROM table;
SELECT collect_list(col RESPECT NULLS) OVER (PARTITION BY id) FROM table;
```
### How was this patch tested?
Added unit tests in DataFrameAggregateSuite:
- `array_agg and collect_list skip nulls by default`
- `array_agg with IGNORE NULLS explicitly skips nulls`
- `array_agg with RESPECT NULLS preserves nulls`
### Was this patch authored or co-authored using generative AI tooling?
Yes, GitHub Copilot was used to assist with this implementation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]