Github user lemire commented on the pull request:
https://github.com/apache/spark/pull/9243#issuecomment-150668521
@rxin
There are definitively cases where attempting to use compressed bitmaps is
wasteful. For example, if you have a small universe size. E.g., your bitmaps
represent sets of integer from [0,n) where n is small (e.g., n=64 or n=128).
It is just generally true that compression is not always a good idea.
The fact that you are able to use uncompressed BitSet and it does not blow
up memory usage tells me that you might be in a scenario where compression is
not useful.
Techniques like Roaring or Concise do not make uncompressed BitSet
obsolete. Rather, they are there to help when regular BitSets would fail you
due to excessive memory usage.
How can this happen? Well. Suppose that you are trying to index a column
containing 1000 distinct integer values. If you try to do it with a BitSet,
each row will use 125 bytes... just to index this column... if you have 10,000
distinct values, then you use over 1kB per row just to index this one column.
And so forth.
But, if your BitSets are tiny then compressing them could definitively be
wasteful.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]