As I'm sure you can imagine, we store a lot of data in protocol buffer
format at Google, so we often want to store very large files with many
serialized protocol buffers.  The technique we use is to batch a bunch of
records together, compress them and write the compressed block to the file
(with checksumming and positioning markers).  Then if you want to read a
specific record, you need to decompress one block.  That doesn't take much
longer than the disk seek, so it's not a problem unless you have huge
blocks.

The code we use to do this and the file format is honestly a bit of mess
(it's grown slowly over many years), so it's not suitable to be
open-sourced.  It certainly makes sense to have an open source library to do
this, and it sounds similar to what your code is aiming at.  But I agree
with Kenton that it should not be part of the Protocol Buffer library -- it
should be a separate project.  It doesn't even need to be directly connected
to Protocol Buffers -- you can use the same format for any kind of record.

Daniel

On Fri, Dec 11, 2009 at 2:20 PM, Jacob Rief <[email protected]> wrote:

> Hello Chris,
>
> 2009/12/10 Christopher Smith <[email protected]>:
> > One compression algo that I thought would be particularly useful with
> PB's
> > would be LZO. It lines up nicely with PB's goals of being fast and
> compact.
> > Have you thought about allowing an integrated LZO stream?
> >
> > --Chris
>
> My goal is to compress huge amounts >5GB of small serialized chunks
> (~150...500 Bytes) into a single stream, and still being able to
> randomly access each part of it without having to decompress to whole
> stream. GzipOutputStream (with level 5) reduces the size to about 40%
> compared to the uncompressed binary stream, whereas my
> LzipOutputStream (with level 5) reduces the size to about 20%. The
> difficulty with gzip is to find synchronizing boundaries in the stream
> during uncompression
> If your aim is to exchange small messages, say by RPC, than a fast but
> less efficient algorithm is the right choice. If however you want to
> store huge amounts of data permanently, your requirements may be
> different.
>
> In my opinion, generic streaming classes such as
> ZeroCopyIn/OutputStream, shall offer different compression algorithms
> for different purposes. LZO has advantages if used for communication
> of small to medium sized chunks of data. LZMA on the other hand has
> advantages if you have to store lots of data for a long term. GZIP is
> somewhere in the middle. Unfortunately Kenton has another opinion
> about adding too many compression streaming classes.
>
> Today I studied the API of LZO. From what I have seen, I think one
> could implement two LzoIn/OutputStream classes. LZO compression
> however has a small drawback, let me explain why: The LZO API is not
> intended to be used for streams. Instead it always compresses and
> decompresses a whole block. This is different behaviour than gzip and
> lzip, which are intended to compress streams. A compression class has
> a fixed sized buffer of typically 8 or 64kB. If this buffer is filled
> with data, lzip and gzip digest the input and you can start to fill
> the buffer from the beginning. On the other hand, the LZO compressor
> has to compress the whole buffer in one step. The next block then has
> to be concatenated with the already compressed data, which means that
> during decompression you have to fiddle these chunks apart.
>
> If your intention is to compress a chunk of data with, say less than
> 64kB each, and then to put it on the wire, then LZO is the right
> solution for you. For my requirements, as you will understand now, LZO
> does not really fit well.
> If there is a strong interest in an alternative Protocol Buffer
> compression stream, don't hesitate to contact me.
>
> Jacob
>
> --
>
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<protobuf%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/protobuf?hl=en.
>
>
>

--

You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.


Reply via email to