[ https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999142#comment-14999142 ]
Daniel Lemire edited comment on SPARK-11583 at 11/10/15 8:42 PM: ----------------------------------------------------------------- [~Qin Yao] wrote: "Roaringbitmap is same as the BitSet now we use in HiglyCompressedMapStatus, but take 20% memory usage more than BitSet. They both don't be compressed in-memory. According to the annotations of the former Roaring-HiglyCompressedMapStatus, it can be compressed during serialization not in-memory." I think that's a misunderstanding. Lucene and Apache Kylin use Roaring for in-memory bitmaps, and it saves a ton of memory. Druid uses them for memory-mapped bitmaps, and it compresses well. If you do flips, then it is possible that Roaring might end up being inefficient. Lucene has one approach to solve this matter and, in RoaringBitmap, we offer the "runOptimize" function. But, generally, you should expect Roaring bitmaps to compress rather well. Please get in touch with examples if you want, we could discuss the matter further. was (Author: lemire): [~Qin Yao] wrote: "Roaringbitmap is same as the BitSet now we use in HiglyCompressedMapStatus, but take 20% memory usage more than BitSet. They both don't be compressed in-memory. According to the annotations of the former Roaring-HiglyCompressedMapStatus, it can be compressed during serialization not in-memory." I think that's a misunderstanding. Lucene and Apache Kylin use Roaring for in-memory bitmaps, and it saves a ton of memory. Druid uses them for memory-mapped bitmaps, and it compresses well. If you do flips, then it is possible that Roaring might end up being inefficient. Lucene has one approach to that, in RoaringBitmap, we offer the "runOptimize" function. > Make MapStatus use less memory uage > ----------------------------------- > > Key: SPARK-11583 > URL: https://issues.apache.org/jira/browse/SPARK-11583 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core > Reporter: Kent Yao > > In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I > said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. > For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. > Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125]. > So if we use a HashSet[Int] to store reduceId (when non-empty blocks are > dense,use reduceId of empty blocks; when sparse, use non-empty ones). > For dense cases: if HashSet[Int](numNonEmptyBlocks).size < > BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks > For sparse cases: if HashSet[Int](numEmptyBlocks).size < > BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks > sparse case, 299/300 are empty > sc.makeRDD(1 to 30000, 3000).groupBy(x=>x).top(5) > dense case, no block is empty > sc.makeRDD(1 to 9000000, 3000).groupBy(x=>x).top(5) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org