YanTang Zhai created SPARK-4273:
-----------------------------------
Summary: Providing ExternalSet to avoid OOM when count(distinct)
Key: SPARK-4273
URL: https://issues.apache.org/jira/browse/SPARK-4273
Project: Spark
Issue Type: Improvement
Components: Spark Core, SQL
Reporter: YanTang Zhai
Priority: Minor
Some task may OOM when count(distinct) if it needs to process many records.
CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs
many records, it may occupy large memory.
I think a data structure ExternalSet like ExternalAppendOnlyMap could be
provided to store OpenHashSet data in disks when it's capacity exceeds some
threshold.
For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with
hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be
indicated as follows:
ohs1 [d, b, c, a] => [a, b, c, d] => file1
ohs2 [e, f, g, a] => [a, e, f, g] => file2
ohs3 [e, h, i, g] => [e, g, h, i] => file3
ohs4 [j, h, a] => [a, h, j] => sortedSet
When output, all keys with the same hashCode will be put into a OpenHashSet,
then the iterator of this OpenHashSet is accessing. The procedure could be
indicated as follows:
file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
file1 -> b -> ohsB; ohsB -> b;
file1 -> c -> ohsC; ohsC -> c;
file1 -> d -> ohsD; ohsD -> d;
file2 -> e -> ohsE; file3 -> e -> ohsE; ohsE-> e;
...
I think using the ExternalSet could avoid OOM when count(distinct). Welcomes
comments.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]