[
https://issues.apache.org/jira/browse/SPARK-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200172#comment-14200172
]
Apache Spark commented on SPARK-4273:
-------------------------------------
User 'YanTangZhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3137
> Providing ExternalSet to avoid OOM when count(distinct)
> -------------------------------------------------------
>
> Key: SPARK-4273
> URL: https://issues.apache.org/jira/browse/SPARK-4273
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core, SQL
> Reporter: YanTang Zhai
> Priority: Minor
>
> Some task may OOM when count(distinct) if it needs to process many records.
> CombineSetsAndCountFunction puts all records into an OpenHashSet, if it
> fetchs many records, it may occupy large memory.
> I think a data structure ExternalSet like ExternalAppendOnlyMap could be
> provided to store OpenHashSet data in disks when it's capacity exceeds some
> threshold.
> For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with
> hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be
> indicated as follows:
> ohs1 [d, b, c, a] => [a, b, c, d] => file1
> ohs2 [e, f, g, a] => [a, e, f, g] => file2
> ohs3 [e, h, i, g] => [e, g, h, i] => file3
> ohs4 [j, h, a] => [a, h, j] => sortedSet
> When output, all keys with the same hashCode will be put into a OpenHashSet,
> then the iterator of this OpenHashSet is accessing. The procedure could be
> indicated as follows:
> file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
> file1 -> b -> ohsB; ohsB -> b;
> file1 -> c -> ohsC; ohsC -> c;
> file1 -> d -> ohsD; ohsD -> d;
> file2 -> e -> ohsE; file3 -> e -> ohsE; ohsE-> e;
> ...
> I think using the ExternalSet could avoid OOM when count(distinct). Welcomes
> comments.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]