Kent Yao created SPARK-55256:
--------------------------------
Summary: [SQL] Support IGNORE NULLS / RESPECT NULLS for array_agg
and collect_list
Key: SPARK-55256
URL: https://issues.apache.org/jira/browse/SPARK-55256
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Kent Yao
This PR adds support for the IGNORE NULLS and RESPECT NULLS clauses for
array_agg and collect_list aggregate functions.
The SQL standard and many databases (PostgreSQL, Snowflake, DuckDB, etc.)
support the IGNORE NULLS / RESPECT NULLS syntax for aggregate functions.
Currently, Spark only supports this syntax for window functions like first,
last, lead, lag, and nth_value.
By adding this support to array_agg and collect_list, users can explicitly
control whether null values should be included in the resulting array:
- array_agg(col) IGNORE NULLS - skips null values (default behavior)
- array_agg(col) RESPECT NULLS - includes null values in the result
Implementation Details:
1. Added ignoreNulls: Boolean = true parameter to CollectList class
2. array_agg now uses CollectList as they have identical behavior
3. Changed UnresolvedFunction.ignoreNulls from Boolean to Option[Boolean] to
distinguish between None (use function default), Some(true) (IGNORE NULLS),
Some(false) (RESPECT NULLS)
4. Consolidated ignoreNulls resolution logic in FunctionResolution with shared
resolveIgnoreNulls and applyIgnoreNulls methods
Users can now use IGNORE NULLS / RESPECT NULLS with array_agg and collect_list:
SELECT array_agg(col IGNORE NULLS) FROM table;
SELECT collect_list(col RESPECT NULLS) OVER (PARTITION BY id) FROM table;
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]