Hi Petr [and John],

On Wed, Apr 25, 2012 at 1:18 AM, Thorgrin <[email protected]> wrote:

> Hi John,
>
> I've just stumbled upon a LZO data compression library
> (http://www.oberhumer.com/opensource/lzo/), which is used for example
> in nfdump, to cut down the space needed to store data. I wonder
> whether you ever considered adding some realtime
> compression/decompression to FastBit, since you work with a lot o
> data. I know that the indexes use a compression, but the data is not.
>
> How difficult would it be to have columns optionally compressed when
> writing to disk (maybe indicated by some file extension) and
> decompressed when needed to answer a query (which might even not be
> necessary if we look at the indexes first)?
>
>
I have thought of doing this.

Of corse compressing the entire data column makes no sense as you would
need to decompress it up to the point that the data value exists.

I think you would have to compress by blocks, an example (size is just
picked
out of the air)

bytes        0 to 10240 (exclusive) would be block 0
bytes 10240 to 20480 (exclusive) would be block 1
  *

  *

The each individual data block could be compressed, and only
decompressed on need if the block is referenced for data.

And then the MMAP and or file system offet code (to look up data)
would need to be modified to support compression, plus you would
want an LRU and/or also a HOT in memory category cache.

An alternative (and much easier method) would be to store the DATA
on a compressed file system and the index on a non-compressed
file system.  Thus would require two (2) directories one for index and
one for data, but IMHO this would be an easier change.  My thought
here is to use Solaris 10, SE11, or an illumos based distro like
OpenIndiana and utilize the ZFS file system (allowing compression
on the "data" directory, but not allowing compression on the "index"
directory).

I do not know if anyone needs such a feature, but I can see some
> benefits in it for our usecase.


FYI, I have put standard fastbit on a compressed ZFS directory (both data
and index
in the default hierarchy) and it is still performant , I think this is due
to much less
IO e.g. a factor of 2-3x reading data since it is compressed (and done by
block)
transparent to the application.  ZFS on an illumos kernel supports the
following:

on | off | lzjb | gzip | gzip-[1-9] | zle


A LZO implementation could be added, but I find that lzjb works very good.
My point
is that this "takes care of" block level compression transparently and also
does
caching with any free memory (automatically released if the system or
application
needs it).  Thus my comment earlier about adding a flag to fastbit to split
up index
and data on different file paths (and thus ZFS file systems).

I was also thinking of digging into the ZFS code itself and making a "low
level"
set of hooks for data storage in fastbit.  However this assumes that I get
some
free time.


> Regards,
> Petr
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>

Regards,

Jon Strabala, CTO
Quantum Systems Integrators, Inc.
950 South Coast Drive, Suite 120
Costa Mesa, CA 92626

[email protected]
http://www.QuantumSI.com <http://www.quantumsi.com/>
phone  714 428 1133
fax    714 428 1131
mobile 714 240 3083
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to