GitHub user yaooqinn opened a pull request:

    https://github.com/apache/spark/pull/9559

    [SPARK-11583] Make MapStatus use less memory uage

    In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as 
I said, using BitSet can save ≈20% memory usage compared to RoaringBitMap.
    For a spark job contains quite a lot of tasks, 20% seems a drop in the 
ocean.
    Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
    So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
dense,use reduceId of empty blocks; when sparse, use non-empty ones).
    For dense cases: if HashSet[Int](numNonEmptyBlocks).size < 
BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
    For sparse cases: if HashSet[Int](numEmptyBlocks).size < 
BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
    sparse case, 299/300 are empty
    sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
    dense case, no block is empty
    sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yaooqinn/spark mapstatus-smaller

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9559.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9559
    
----
commit 79353eb4e1c8ef69278681a9c6eaad7b359c4270
Author: YAOQIN <yao...@huawei.com>
Date:   2015-11-07T07:04:13Z

    map status smaller

commit 492adeb58871c179bf5f6b553111c6f2a2b3ee3b
Author: Kent Yao <yaooq...@hotmail.com>
Date:   2015-11-09T02:21:51Z

    Let MapStatus be smaller in sparse/tense cases

commit cb4bce57fdfad98a64821870025b8173447cf882
Author: Kent Yao <yaooq...@hotmail.com>
Date:   2015-11-09T03:22:03Z

    Let MapStatus be smaller in sparse/tense cases 1

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to