YanTang Zhai created SPARK-4273:
-----------------------------------

             Summary: Providing ExternalSet to avoid OOM when count(distinct)
                 Key: SPARK-4273
                 URL: https://issues.apache.org/jira/browse/SPARK-4273
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, SQL
            Reporter: YanTang Zhai
            Priority: Minor


Some task may OOM when count(distinct) if it needs to process many records. 
CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs 
many records, it may occupy large memory.
I think a data structure ExternalSet like ExternalAppendOnlyMap could be 
provided to store OpenHashSet data in disks when it's capacity exceeds some 
threshold.
For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with 
hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be 
indicated as follows:
ohs1 [d, b, c, a] => [a, b, c, d] => file1
ohs2 [e, f, g, a] => [a, e, f, g] => file2
ohs3 [e, h, i, g] => [e, g, h, i] => file3
ohs4 [j, h, a] => [a, h, j] => sortedSet
When output, all keys with the same hashCode will be put into a OpenHashSet, 
then the iterator of this OpenHashSet is accessing. The procedure could be 
indicated as follows:
file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
file1 -> b -> ohsB; ohsB -> b;
file1 -> c -> ohsC; ohsC -> c;
file1 -> d -> ohsD; ohsD -> d;
file2 -> e -> ohsE; file3 -> e -> ohsE; ohsE-> e;
...

I think using the ExternalSet could avoid OOM when count(distinct). Welcomes 
comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to