[jira] [Commented] (SPARK-48362) Add CollectSetWIthLimit

Holden Karau (Jira) Mon, 22 Dec 2025 12:59:09 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-48362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047113#comment-18047113
 ]


Holden Karau commented on SPARK-48362:
--------------------------------------

In my mind we'd do this as a two parter:

1) add limit to collect_set & collect_list (or make limited versions of them 
like collect_set_with_limit)

2) Follow up issue/PR: add an optimizer rule where if folks add a slize over 
top of a collect_set / collect_list we automatically rewrite it to be limited

> Add CollectSetWIthLimit
> -----------------------
>
>                 Key: SPARK-48362
>                 URL: https://issues.apache.org/jira/browse/SPARK-48362
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Holden Karau
>            Priority: Major
>
> See 
> [https://stackoverflow.com/questions/38730912/how-to-limit-functions-collect-set-in-spark-sql]
>  
> Some users want to collect a set but if the number of distinct elements is 
> too large they may get a Cannot grow BufferHolder  error from trying to 
> collect the set then trim it.
>  
> We should offer a collect set which pre-emptively does not add more elements 
> than needed to reduce the amount of memory used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-48362) Add CollectSetWIthLimit

Reply via email to