[GitHub] spark pull request: [SPARK-11271][SPARK-11016][Core] Use Spark Bit...

lemire Fri, 23 Oct 2015 16:30:36 -0700

Github user lemire commented on the pull request:

    https://github.com/apache/spark/pull/9243#issuecomment-150716197
  
    @rxin 
    
    Right.
    
    If you have only 2000 bits or so in your bitmaps, then that's about 250 
bytes per bitmap, maybe less in practice. Considering that a single entry in a 
HashMap can use 150 bytes 
(http://lemire.me/blog/2015/10/15/on-the-memory-usage-of-maps-in-java/), that's 
not something that I'd worry about compressing normally. If that's your use 
case and you do want to compress them, I'd be glad to have a look at the 
problem, but it is not a usual scenario for compressed bitmaps. In our work, we 
often consider bitmaps that range over hundreds of thousands or millions of 
bits, corresponding to large tables. And we have lots of them. Then it makes 
sense to worry about compressing each bitmap. 
    
    Thankfully, I can explain the bitmap formats in a few lines. Maybe this can 
be helpful. 
    
    First, Java's BitSet is not a mere array of bits. Java's BitSet class does 
have a form of compression: it does not store trailing zeroes. This can help  
in some cases (to speed up processing and reduce memory usage).  When it is 
applicable, it is pretty good and does not waste memory.  However, take a 
BitSet and set the bit at position 1,000,000 to true and you have just over 
100kB. That's over 100kB to store the position of one bit. But even if you do 
not care about memory... speed becomes a factor...  when you want to compute 
the intersection between this BitSet, and the BitSet that has a bit at position 
1,000,001to true, you need to go through all these zeroes, whether you like it 
or not. That can become very wasteful.
    
    
    For BBC (used by Oracle), WAH, EWAH (used by Apache Hive), Concise (used by 
Druid in addition to Roaring)... you identify long runs of 1s or 0s and you 
represent them with a marker word. So it is, essentially, run length encoding 
except that if you have a mix of 1s and 0s, you use an uncompressed word. So it 
is best described as a hybrid between RLE and bitmaps. Oracle's BBC is pretty 
much obsolete. Concise compresses better than most alternatives in specific 
cases, but is otherwise a slower version of WAH. EWAH has lower compression 
ratios, but faster processing speed. EWAH is faster because it allows some form 
of "skipping".
    
    There is a big problem with these formats however that can hurt you badly 
in some cases: there is no random access. If you want to check whether a given 
value is present in the set, you have to start from the beginning and 
"uncompress" the whole thing. This means that if you want to intersect a big 
set with a large set, you still have to uncompress the whole big set in the 
worst case... That's bad news. This is not a property of run-length encoding 
per se... since you can have run-length encoding *and* fast random access, but  
here we have a hybrid so that unindexed access is a problem.
    
    Roaring solves this problem. It works in the following manner. It divides 
the data into chunks of 2^{16} integers (e.g., [0, 2^{16}), [2^{16}, 2 \times 
2^{16}), ...). Within a chunk, it can use an uncompressed bitmap, a simple list 
of integers, or a list of runs. Whatever format it uses, they all allow you to 
check for the present of any one value quickly (e.g., with a binary search). 
The net result is that Roaring can compute many operations much faster that 
run-length-encoded formats (WAH, EWAH, Concise...). Maybe surprisingly, it also 
generally offers better compression ratios. 
    
    All these formats, Roaring, WAH, Concise... basically "look" very close to 
an uncompressed bitmap when the data is uncompressible... (Roaring will use 
uncompressed bitmap containers.) However, they all add a bit of overhead.  The 
storage overhead is not what should worry you however... what should worry you 
is that by storing uncompressible data in a compressed format, you are slowing 
things down.
    
    Think about zipping a binary file that is not compressible (e.g., a JPEG 
file). Not only would the resulting file be slightly larger, but, also, 
accessing the data would take quite a bit longer... needlessly. 
    
    Let me conclude with this: all these bitmap formats are quite simple 
conceptually. They behave in rather predictable ways. So given the data, it is 
not very difficult to benchmark them. If you do have the data, I can setup a 
benchmark in no time, if you tell me what you care about.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11271][SPARK-11016][Core] Use Spark Bit...

Reply via email to