Re: [FastBit-users] Data compression and bitmap index ideas

K. John Wu Mon, 15 Jun 2015 08:27:13 -0700

Hi, Justin,

Looks like you are on to a great idea.  In general, the idea of mixing
RIDs with bitmaps could be thought of as a variant of PLWAH
compression <https://www.google.com/webhp?q=PLWAH>.  The use of sparse
pages might not have been done before, however, there are a number of
attempts to do indexes per page, here is one example
<http://link.springer.com/chapter/10.1007%2F978-3-642-41221-9_4>.


Good luck.

John


On 6/14/15 11:53 PM, Justin Swanhart wrote:
> Hi,
> 
> I'm working on a page based bitmap index which uses sparse files to
> RLE compress regions of the bitmap.  I essentially create a linked
> list on disk in a very large sparse file, then allocate space in the
> file for the data and for each index.  Because nodes in the list are
> at fixed location it is as if there are pointers directly into each
> node.  Adding space between nodes is therefore O(1), as is traversing
> the list as it can be entered at predefined points and simply do one
> or two seeks to the necessary location.  This is cheap for SSD.  Runs
> of zeroes one page or longer can be representing using only sparse
> space.  Not bits are stored on disk for runs of zero >= 1 page.  Runs
> of ones require either one or two pages (they are 4k pages).  In
> addition if a page goes from having no data to being full, or being
> full to having no data, a hole can usually be punched in the
> filesystem so that the SSD can better manage space.  
> 
> Rather than use binning, there is a b+tree index which keeps track of
> the distinct values in the index, and if there is only one value, the
> RID is stored in the b+tree so that creating the bitmap is not
> necessary, and if there are >=1 and  <= 512 distinct values, a list of
> RIDS are stored on the first index page, so bitmap pages are created
> on the fly until all the slots fill and the real index is
> materialized.  Do you think binning is a better approach than this?
> 
> I only mention this because I would like to use WAH for compression of
> regions that are not sparse, and I plan to use data compression via
> what I think is a unique method, one which could benefit Fastbit. 
> Also, I haven't looked closely at Fastbit in a bit.  Is there an easy
> way to use a Fastbit WAH index as a bit vector rather than having to
> store the data into fastbit partitions?
> 
> For me, each column is stored as pages.  Each page has a CRC32, on
> index pages there is a mask of the page, which represents sections
> that are either empty, or sections that contain some data.  This can
> make page comparisons really fast.
> 
> The compression works by taking a 1MB extent and compressing it down
> with PFOR, FPC, RLE or other forms of compression.   Let's say the 1MB
> extend shrinks to 50K.  At the beginning of the 1MB extent, the 50K is
> written and the rest of the space is reserved as sparse space.  You
> can seek into any extent and decompress the data without having to
> search forward through an RLE compressed structure.  Since Fastbit
> does not compress data (at least I don't think it does) such a
> mechanism might be useful to you.  The same technique can be used for
> WAH, though I do not know if it would benefit.   Reserve a chunk of
> sparse space on disk (1M)  WAH compress 1M of data, then write the
> smaller amount onto the disk.  Then you can still seek into that 1M
> extent and get the data, but the size on disk is significantly smaller.
> 
> Anyway, if you think this is a good idea, please let me know.  
> 
> Regards,
> 
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Data compression and bitmap index ideas

Reply via email to