Hi Petr [and John], On Wed, Apr 25, 2012 at 1:18 AM, Thorgrin <[email protected]> wrote:
> Hi John, > > I've just stumbled upon a LZO data compression library > (http://www.oberhumer.com/opensource/lzo/), which is used for example > in nfdump, to cut down the space needed to store data. I wonder > whether you ever considered adding some realtime > compression/decompression to FastBit, since you work with a lot o > data. I know that the indexes use a compression, but the data is not. > > How difficult would it be to have columns optionally compressed when > writing to disk (maybe indicated by some file extension) and > decompressed when needed to answer a query (which might even not be > necessary if we look at the indexes first)? > > I have thought of doing this. Of corse compressing the entire data column makes no sense as you would need to decompress it up to the point that the data value exists. I think you would have to compress by blocks, an example (size is just picked out of the air) bytes 0 to 10240 (exclusive) would be block 0 bytes 10240 to 20480 (exclusive) would be block 1 * * The each individual data block could be compressed, and only decompressed on need if the block is referenced for data. And then the MMAP and or file system offet code (to look up data) would need to be modified to support compression, plus you would want an LRU and/or also a HOT in memory category cache. An alternative (and much easier method) would be to store the DATA on a compressed file system and the index on a non-compressed file system. Thus would require two (2) directories one for index and one for data, but IMHO this would be an easier change. My thought here is to use Solaris 10, SE11, or an illumos based distro like OpenIndiana and utilize the ZFS file system (allowing compression on the "data" directory, but not allowing compression on the "index" directory). I do not know if anyone needs such a feature, but I can see some > benefits in it for our usecase. FYI, I have put standard fastbit on a compressed ZFS directory (both data and index in the default hierarchy) and it is still performant , I think this is due to much less IO e.g. a factor of 2-3x reading data since it is compressed (and done by block) transparent to the application. ZFS on an illumos kernel supports the following: on | off | lzjb | gzip | gzip-[1-9] | zle A LZO implementation could be added, but I find that lzjb works very good. My point is that this "takes care of" block level compression transparently and also does caching with any free memory (automatically released if the system or application needs it). Thus my comment earlier about adding a flag to fastbit to split up index and data on different file paths (and thus ZFS file systems). I was also thinking of digging into the ZFS code itself and making a "low level" set of hooks for data storage in fastbit. However this assumes that I get some free time. > Regards, > Petr > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > Regards, Jon Strabala, CTO Quantum Systems Integrators, Inc. 950 South Coast Drive, Suite 120 Costa Mesa, CA 92626 [email protected] http://www.QuantumSI.com <http://www.quantumsi.com/> phone 714 428 1133 fax 714 428 1131 mobile 714 240 3083
_______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
