GitHub user yaooqinn opened a pull request:

    https://github.com/apache/spark/pull/9661

    [SPARK-11583] [Core]MapStatus Using RoaringBitmap More Properly

    1. test cases
    1.1 sparse case: for each task 10 blocks contains data, others dont
    sc.makeRDD(1 to 40950, 4095).groupBy(x=>x).top(5)
    1.2 dense case: for each task most block contains data, few dont
    1.2.1 full
    sc.makeRDD(1 to 16769025, 4095).groupBy(x=>x).top(5)
    1.2.2 very dense: about 95 empty blocks
    sc.makeRDD(1 to 16380000, 4095).groupBy(x=>x).top(5)
    1.3 test tool
    jmap -dump:format=b,file=heap.bin <pid>
    1.4 test branches: branch-1.5, master
    2. memory usage
    2.1 RoaringBitmap--sparse
    Class Name | Objects | Shallow Heap | Retained Heap
    
---------------------------------------------------------------------------------------------
    org.apache.spark.scheduler.HighlyCompressedMapStatus| 4,095 | 131,040 | >= 
34,135,920
    
---------------------------------------------------------------------------------------------
    my explaination: 4095 * short[4095-10] =4095 * 16 * 4085 / 8 ≈ 34,135,920
    2.2.1 RoaringBitmap--full
    Class Name | Objects | Shallow Heap | Retained Heap
    
---------------------------------------------------------------------------------------------
    org.apache.spark.scheduler.HighlyCompressedMapStatus| 4,095 | 131,040 | >= 
360,360 
    
---------------------------------------------------------------------------------------------
    my explaination:RoaringBitmap(0)
    2.2.2 RoaringBitmap--very dense
    Class Name | Objects | Shallow Heap | Retained Heap
    
---------------------------------------------------------------------------------------------
    org.apache.spark.scheduler.HighlyCompressedMapStatus| 4,095 | 131,040 | >= 
1,441,440
    
---------------------------------------------------------------------------------------------
    my explaination:4095 * short[95] = 4095 * 16 * 95 / 8 = 778, 050 (+ 
others = 1441440)
    2.3 BitSet--sparse
    Class Name | Objects | Shallow Heap | Retained Heap
    
---------------------------------------------------------------------------------------------
    org.apache.spark.scheduler.HighlyCompressedMapStatus| 4,095 | 131,040 | >= 
2,391,480
    
---------------------------------------------------------------------------------------------
    my explaination:4095 * 4095 =16,769,025 + (others = 2,391,480)
    2.4 BitSet--full
    Class Name | Objects | Shallow Heap | Retained Heap
    
---------------------------------------------------------------------------------------------
    org.apache.spark.scheduler.HighlyCompressedMapStatus| 4,095 | 131,040 | >= 
2,391,480
    
---------------------------------------------------------------------------------------------
    my explaination:same as the above
    3. conclusion
    memory usage:
    RoaringBitmap--full < RoaringBitmap--very dense < BitSet---full = 
BitSet--sparse < RoaringBitmap--sparse
    
    In this specific case, for  RoaringBitmap--sparse, use RoaringBitmap to 
mark non-empty blocks instead of empty ones. In this way, memory usage can stay 
low in all cases.
    
    This is better than my former pr ---- 
https://github.com/apache/spark/pull/9559

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yaooqinn/spark mapstatus-roaring

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9661.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9661
    
----
commit 75c3209c4bba8902eb0a4c1649864106f015c39f
Author: Kent Yao <[email protected]>
Date:   2015-11-12T11:25:02Z

    MapStatus Using RoaringBitmap More Properly

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to