[PR] [SPARK-55533][SQL] Support IGNORE NULLS / RESPECT NULLS for collect_set [spark]

via GitHub Sat, 14 Feb 2026 21:15:52 -0800


yaooqinn opened a new pull request, #54329:
URL: https://github.com/apache/spark/pull/54329


   ### What changes were proposed in this pull request?
   
   This PR adds `IGNORE NULLS` / `RESPECT NULLS` support to `collect_set`, 
mirroring the existing `collect_list` behavior added in SPARK-55256.
   
   - `collect_set(expr)` — default, skips nulls (unchanged behavior)
   - `collect_set(expr) IGNORE NULLS` — explicitly skips nulls
   - `collect_set(expr) RESPECT NULLS` — includes null in the result set
   
   ### Why are the changes needed?
   
   For consistency: `collect_list`/`array_agg` already supports this syntax, 
but `collect_set` does not. Users who want to include null in collected sets 
currently have no way to do so.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. `collect_set` now accepts `IGNORE NULLS` and `RESPECT NULLS` clauses in 
SQL.
   
   ### How was this patch tested?
   
   Added 3 new tests in `DataFrameAggregateSuite`:
   - `collect_set skips nulls by default`
   - `collect_set with IGNORE NULLS explicitly skips nulls`
   - `collect_set with RESPECT NULLS preserves null in set`
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, co-authored with GitHub Copilot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55533][SQL] Support IGNORE NULLS / RESPECT NULLS for collect_set [spark]

Reply via email to