Github user lemire commented on the pull request:
https://github.com/apache/spark/pull/9243#issuecomment-150716197
@rxin
Right.
If you have only 2000 bits or so in your bitmaps, then that's about 250
bytes per bitmap, maybe less in practice. Considering that a single entry in a
HashMap can use 150 bytes
(http://lemire.me/blog/2015/10/15/on-the-memory-usage-of-maps-in-java/), that's
not something that I'd worry about compressing normally. If that's your use
case and you do want to compress them, I'd be glad to have a look at the
problem, but it is not a usual scenario for compressed bitmaps. In our work, we
often consider bitmaps that range over hundreds of thousands or millions of
bits, corresponding to large tables. And we have lots of them. Then it makes
sense to worry about compressing each bitmap.
Thankfully, I can explain the bitmap formats in a few lines. Maybe this can
be helpful.
First, Java's BitSet is not a mere array of bits. Java's BitSet class does
have a form of compression: it does not store trailing zeroes. This can help
in some cases (to speed up processing and reduce memory usage). When it is
applicable, it is pretty good and does not waste memory. However, take a
BitSet and set the bit at position 1,000,000 to true and you have just over
100kB. That's over 100kB to store the position of one bit. But even if you do
not care about memory... speed becomes a factor... when you want to compute
the intersection between this BitSet, and the BitSet that has a bit at position
1,000,001to true, you need to go through all these zeroes, whether you like it
or not. That can become very wasteful.
For BBC (used by Oracle), WAH, EWAH (used by Apache Hive), Concise (used by
Druid in addition to Roaring)... you identify long runs of 1s or 0s and you
represent them with a marker word. So it is, essentially, run length encoding
except that if you have a mix of 1s and 0s, you use an uncompressed word. So it
is best described as a hybrid between RLE and bitmaps. Oracle's BBC is pretty
much obsolete. Concise compresses better than most alternatives in specific
cases, but is otherwise a slower version of WAH. EWAH has lower compression
ratios, but faster processing speed. EWAH is faster because it allows some form
of "skipping".
There is a big problem with these formats however that can hurt you badly
in some cases: there is no random access. If you want to check whether a given
value is present in the set, you have to start from the beginning and
"uncompress" the whole thing. This means that if you want to intersect a big
set with a large set, you still have to uncompress the whole big set in the
worst case... That's bad news. This is not a property of run-length encoding
per se... since you can have run-length encoding *and* fast random access, but
here we have a hybrid so that unindexed access is a problem.
Roaring solves this problem. It works in the following manner. It divides
the data into chunks of 2^{16} integers (e.g., [0, 2^{16}), [2^{16}, 2 \times
2^{16}), ...). Within a chunk, it can use an uncompressed bitmap, a simple list
of integers, or a list of runs. Whatever format it uses, they all allow you to
check for the present of any one value quickly (e.g., with a binary search).
The net result is that Roaring can compute many operations much faster that
run-length-encoded formats (WAH, EWAH, Concise...). Maybe surprisingly, it also
generally offers better compression ratios.
All these formats, Roaring, WAH, Concise... basically "look" very close to
an uncompressed bitmap when the data is uncompressible... (Roaring will use
uncompressed bitmap containers.) However, they all add a bit of overhead. The
storage overhead is not what should worry you however... what should worry you
is that by storing uncompressible data in a compressed format, you are slowing
things down.
Think about zipping a binary file that is not compressible (e.g., a JPEG
file). Not only would the resulting file be slightly larger, but, also,
accessing the data would take quite a bit longer... needlessly.
Let me conclude with this: all these bitmap formats are quite simple
conceptually. They behave in rather predictable ways. So given the data, it is
not very difficult to benchmark them. If you do have the data, I can setup a
benchmark in no time, if you tell me what you care about.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]