Hi, Justin, Looks like you are on to a great idea. In general, the idea of mixing RIDs with bitmaps could be thought of as a variant of PLWAH compression <https://www.google.com/webhp?q=PLWAH>. The use of sparse pages might not have been done before, however, there are a number of attempts to do indexes per page, here is one example <http://link.springer.com/chapter/10.1007%2F978-3-642-41221-9_4>.
Good luck. John On 6/14/15 11:53 PM, Justin Swanhart wrote: > Hi, > > I'm working on a page based bitmap index which uses sparse files to > RLE compress regions of the bitmap. I essentially create a linked > list on disk in a very large sparse file, then allocate space in the > file for the data and for each index. Because nodes in the list are > at fixed location it is as if there are pointers directly into each > node. Adding space between nodes is therefore O(1), as is traversing > the list as it can be entered at predefined points and simply do one > or two seeks to the necessary location. This is cheap for SSD. Runs > of zeroes one page or longer can be representing using only sparse > space. Not bits are stored on disk for runs of zero >= 1 page. Runs > of ones require either one or two pages (they are 4k pages). In > addition if a page goes from having no data to being full, or being > full to having no data, a hole can usually be punched in the > filesystem so that the SSD can better manage space. > > Rather than use binning, there is a b+tree index which keeps track of > the distinct values in the index, and if there is only one value, the > RID is stored in the b+tree so that creating the bitmap is not > necessary, and if there are >=1 and <= 512 distinct values, a list of > RIDS are stored on the first index page, so bitmap pages are created > on the fly until all the slots fill and the real index is > materialized. Do you think binning is a better approach than this? > > I only mention this because I would like to use WAH for compression of > regions that are not sparse, and I plan to use data compression via > what I think is a unique method, one which could benefit Fastbit. > Also, I haven't looked closely at Fastbit in a bit. Is there an easy > way to use a Fastbit WAH index as a bit vector rather than having to > store the data into fastbit partitions? > > For me, each column is stored as pages. Each page has a CRC32, on > index pages there is a mask of the page, which represents sections > that are either empty, or sections that contain some data. This can > make page comparisons really fast. > > The compression works by taking a 1MB extent and compressing it down > with PFOR, FPC, RLE or other forms of compression. Let's say the 1MB > extend shrinks to 50K. At the beginning of the 1MB extent, the 50K is > written and the rest of the space is reserved as sparse space. You > can seek into any extent and decompress the data without having to > search forward through an RLE compressed structure. Since Fastbit > does not compress data (at least I don't think it does) such a > mechanism might be useful to you. The same technique can be used for > WAH, though I do not know if it would benefit. Reserve a chunk of > sparse space on disk (1M) WAH compress 1M of data, then write the > smaller amount onto the disk. Then you can still seek into that 1M > extent and get the data, but the size on disk is significantly smaller. > > Anyway, if you think this is a good idea, please let me know. > > Regards, > > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
