[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999142#comment-14999142
 ] 

Daniel Lemire edited comment on SPARK-11583 at 11/10/15 8:42 PM:
-----------------------------------------------------------------

[~Qin Yao] wrote: 

"Roaringbitmap is same as the BitSet now we use in HiglyCompressedMapStatus, 
but take 20% memory usage more than BitSet. They both don't be compressed 
in-memory. According to the annotations of the former 
Roaring-HiglyCompressedMapStatus, it can be compressed during serialization not 
in-memory."


I think that's a misunderstanding.

Lucene and Apache Kylin use Roaring for in-memory bitmaps, and it saves a ton 
of memory. Druid uses them for memory-mapped bitmaps, and it compresses well.

If you do flips, then it is possible that Roaring might end up being 
inefficient. Lucene has one approach to solve this matter and, in 
RoaringBitmap, we offer the "runOptimize" function. But, generally, you should 
expect Roaring bitmaps to compress rather well.

Please get in touch with examples if you want, we could discuss the matter 
further.


was (Author: lemire):
[~Qin Yao] wrote: 

"Roaringbitmap is same as the BitSet now we use in HiglyCompressedMapStatus, 
but take 20% memory usage more than BitSet. They both don't be compressed 
in-memory. According to the annotations of the former 
Roaring-HiglyCompressedMapStatus, it can be compressed during serialization not 
in-memory."


I think that's a misunderstanding.

Lucene and Apache Kylin use Roaring for in-memory bitmaps, and it saves a ton 
of memory. Druid uses them for memory-mapped bitmaps, and it compresses well.

If you do flips, then it is possible that Roaring might end up being 
inefficient. Lucene has one approach to that, in RoaringBitmap, we offer the 
"runOptimize" function.

> Make MapStatus use less memory uage
> -----------------------------------
>
>                 Key: SPARK-11583
>                 URL: https://issues.apache.org/jira/browse/SPARK-11583
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Spark Core
>            Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to