GitHub user YanTangZhai opened a pull request:

    https://github.com/apache/spark/pull/3137

    [WIP] [SPARK-4273] [SQL] Providing ExternalSet to avoid OOM when 
count(distinct)

    Some task may OOM when count(distinct) if it needs to process many records. 
CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs 
many records, it may occupy large memory.
    I think a data structure ExternalSet like ExternalAppendOnlyMap could be 
provided to store OpenHashSet data in disks when it's capacity exceeds some 
threshold.
    For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 
with hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could 
be indicated as follows:
    ohs1 [d, b, c, a] => [a, b, c, d] => file1
    ohs2 [e, f, g, a] => [a, e, f, g] => file2
    ohs3 [e, h, i, g] => [e, g, h, i] => file3
    ohs4 [j, h, a] => [a, h, j] => sortedSet
    When output, all keys with the same hashCode will be put into a 
OpenHashSet, then the iterator of this OpenHashSet is accessing. The procedure 
could be indicated as follows:
    file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
    file1 -> b -> ohsB; ohsB -> b;
    file1 -> c -> ohsC; ohsC -> c;
    file1 -> d -> ohsD; ohsD -> d;
    file2 > e -> ohsE; file3 -> e -> ohsE; ohsE> e;
    ...
    I think using the ExternalSet could avoid OOM when count(distinct). 
Welcomes comments.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/YanTangZhai/spark ExternalAggregate

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3137.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3137
    
----
commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai <[email protected]>
Date:   2014-08-06T13:07:08Z

    Merge pull request #1 from apache/master
    
    update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai <[email protected]>
Date:   2014-08-20T13:14:08Z

    Merge pull request #3 from apache/master
    
    Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai <[email protected]>
Date:   2014-09-12T06:54:58Z

    Merge pull request #6 from apache/master
    
    Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai <[email protected]>
Date:   2014-09-16T12:03:22Z

    Merge pull request #7 from apache/master
    
    Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai <[email protected]>
Date:   2014-10-20T12:52:22Z

    Merge pull request #8 from apache/master
    
    update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai <[email protected]>
Date:   2014-11-04T09:00:31Z

    Merge pull request #9 from apache/master
    
    Update

commit eecb499bb10b21d648ae9e6c0282fafcde111994
Author: yantangzhai <[email protected]>
Date:   2014-11-06T12:57:29Z

    A method to avoid OOM when count(distinct) by providing ExternalSet

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to