Re: [FastBit-users] Data compression

Jon Strabala Sat, 28 Apr 2012 20:25:25 -0700

John,

> Having the storage layer (like ZFS) to compress separately
> might be a reasonable way to get the benefit of both worlds.


Intuitively I think the index would need to be in another directory
(file system non-compressed) and the data would be in a different
directory (file system zfs-compressed).  But as of now both data
and index get tucked away in the same directory.  I imagine I could
symlink or hardlink the  data files elsewhere (file system zfs
compressed) this would be fine for a test but kind of a pain in a
production system.  How much work would it be to add an option
to force the split the index and data up to different directories ?

- Jon

On Fri, Apr 27, 2012 at 5:12 PM, K. John Wu <[email protected]> wrote:

> As Jon indicated, it will take a major overhaul in order for FastBit
> to make effective use of compressed data.  Having the storage layer
> (like ZFS) to compress separately might be a reasonable way to get the
> benefit of both worlds.
>
> In general, there is a flurry of active research efforts on
> compressing base data in a database system, the best known example
> might be vertica.  There is a free version called c-store, but the
> development work on that has terminated.
>
> John
>
>
> On 4/25/12 9:58 AM, Jon Strabala wrote:
> > Hi Petr [and John],
> >
> > On Wed, Apr 25, 2012 at 1:18 AM, Thorgrin <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Hi John,
> >
> >     I've just stumbled upon a LZO data compression library
> >     (http://www.oberhumer.com/opensource/lzo/), which is used for
> example
> >     in nfdump, to cut down the space needed to store data. I wonder
> >     whether you ever considered adding some realtime
> >     compression/decompression to FastBit, since you work with a lot o
> >     data. I know that the indexes use a compression, but the data is not.
> >
> >     How difficult would it be to have columns optionally compressed when
> >     writing to disk (maybe indicated by some file extension) and
> >     decompressed when needed to answer a query (which might even not be
> >     necessary if we look at the indexes first)?
> >
> >
> > I have thought of doing this.
> >
> > Of course compressing the entire data column makes no sense as you would
> > need to decompress it up to the point that the data value exists.
> >
> > I think you would have to compress by blocks, an example (size is just
> > picked
> > out of the air)
> >
> >     bytes        0 to 10240 (exclusive) would be block 0
> >     bytes 10240 to 20480 (exclusive) would be block 1
> >       *
> >
> >       *
> >
> > The each individual data block could be compressed, and only
> > decompressed on need if the block is referenced for data.
> >
> > And then the MMAP and or file system offet code (to look up data)
> > would need to be modified to support compression, plus you would
> > want an LRU and/or also a HOT in memory category cache.
> >
> > An alternative (and much easier method) would be to store the DATA
> > on a compressed file system and the index on a non-compressed
> > file system.  Thus would require two (2) directories one for index and
> > one for data, but IMHO this would be an easier change.  My thought
> > here is to use Solaris 10, SE11, or an illumos based distro like
> > OpenIndiana and utilize the ZFS file system (allowing compression
> > on the "data" directory, but not allowing compression on the "index"
> > directory).
> >
> >     I do not know if anyone needs such a feature, but I can see some
> >     benefits in it for our usecase.
> >
> >
> > FYI, I have put standard fastbit on a compressed ZFS directory (both
> > data and index
> > in the default hierarchy) and it is still performant , I think this is
> > due to much less
> > IO e.g. a factor of 2-3x reading data since it is compressed (and done
> > by block)
> > transparent to the application.  ZFS on an illumos kernel supports the
> > following:
> >
> >     on | off | lzjb | gzip | gzip-[1-9] | zle
> >
> >
> > A LZO implementation could be added, but I find that lzjb works very
> > good. My point
> > is that this "takes care of" block level compression transparently and
> > also does
> > caching with any free memory (automatically released if the system or
> > application
> > needs it).  Thus my comment earlier about adding a flag to fastbit to
> > split up index
> > and data on different file paths (and thus ZFS file systems).
> >
> > I was also thinking of digging into the ZFS code itself and making a
> > "low level"
> > set of hooks for data storage in fastbit.  However this assumes that I
> > get some
> > free time.
> >
> >
> >     Regards,
> >     Petr
> >     _______________________________________________
> >     FastBit-users mailing list
> >     [email protected] <mailto:[email protected]>
> >     https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> >
> >
> > Regards,
> >
> > Jon Strabala, CTO
> > Quantum Systems Integrators, Inc.
> > 950 South Coast Drive, Suite 120
> > Costa Mesa, CA 92626
> >
> > [email protected]
> > http://www.QuantumSI.com <http://www.quantumsi.com/>
> > phone  714 428 1133
> > fax    714 428 1131
> > mobile 714 240 3083
>

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Data compression

Reply via email to