[
https://issues.apache.org/jira/browse/SPARK-48362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047113#comment-18047113
]
Holden Karau commented on SPARK-48362:
--------------------------------------
In my mind we'd do this as a two parter:
1) add limit to collect_set & collect_list (or make limited versions of them
like collect_set_with_limit)
2) Follow up issue/PR: add an optimizer rule where if folks add a slize over
top of a collect_set / collect_list we automatically rewrite it to be limited
> Add CollectSetWIthLimit
> -----------------------
>
> Key: SPARK-48362
> URL: https://issues.apache.org/jira/browse/SPARK-48362
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Holden Karau
> Priority: Major
>
> See
> [https://stackoverflow.com/questions/38730912/how-to-limit-functions-collect-set-in-spark-sql]
>
> Some users want to collect a set but if the number of distinct elements is
> too large they may get a Cannot grow BufferHolder error from trying to
> collect the set then trim it.
>
> We should offer a collect set which pre-emptively does not add more elements
> than needed to reduce the amount of memory used.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]