[FastBit-users] Data compression and bitmap index ideas

Justin Swanhart Sun, 14 Jun 2015 23:54:10 -0700

Hi,

I'm working on a page based bitmap index which uses sparse files to RLE
compress regions of the bitmap.  I essentially create a linked list on disk
in a very large sparse file, then allocate space in the file for the data
and for each index.  Because nodes in the list are at fixed location it is
as if there are pointers directly into each node.  Adding space between
nodes is therefore O(1), as is traversing the list as it can be entered at
predefined points and simply do one or two seeks to the necessary
location.  This is cheap for SSD.  Runs of zeroes one page or longer can be
representing using only sparse space.  Not bits are stored on disk for runs
of zero >= 1 page.  Runs of ones require either one or two pages (they are
4k pages).  In addition if a page goes from having no data to being full,
or being full to having no data, a hole can usually be punched in the
filesystem so that the SSD can better manage space.


Rather than use binning, there is a b+tree index which keeps track of the
distinct values in the index, and if there is only one value, the RID is
stored in the b+tree so that creating the bitmap is not necessary, and if
there are >=1 and  <= 512 distinct values, a list of RIDS are stored on the
first index page, so bitmap pages are created on the fly until all the
slots fill and the real index is materialized.  Do you think binning is a
better approach than this?

I only mention this because I would like to use WAH for compression of
regions that are not sparse, and I plan to use data compression via what I
think is a unique method, one which could benefit Fastbit.  Also, I haven't
looked closely at Fastbit in a bit.  Is there an easy way to use a Fastbit
WAH index as a bit vector rather than having to store the data into fastbit
partitions?

For me, each column is stored as pages.  Each page has a CRC32, on index
pages there is a mask of the page, which represents sections that are
either empty, or sections that contain some data.  This can make page
comparisons really fast.

The compression works by taking a 1MB extent and compressing it down with
PFOR, FPC, RLE or other forms of compression.   Let's say the 1MB extend
shrinks to 50K.  At the beginning of the 1MB extent, the 50K is written and
the rest of the space is reserved as sparse space.  You can seek into any
extent and decompress the data without having to search forward through an
RLE compressed structure.  Since Fastbit does not compress data (at least I
don't think it does) such a mechanism might be useful to you.  The same
technique can be used for WAH, though I do not know if it would benefit.
Reserve a chunk of sparse space on disk (1M)  WAH compress 1M of data, then
write the smaller amount onto the disk.  Then you can still seek into that
1M extent and get the data, but the size on disk is significantly smaller.

Anyway, if you think this is a good idea, please let me know.

Regards,

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

[FastBit-users] Data compression and bitmap index ideas

Reply via email to