As I'm sure you can imagine, we store a lot of data in protocol buffer format at Google, so we often want to store very large files with many serialized protocol buffers. The technique we use is to batch a bunch of records together, compress them and write the compressed block to the file (with checksumming and positioning markers). Then if you want to read a specific record, you need to decompress one block. That doesn't take much longer than the disk seek, so it's not a problem unless you have huge blocks.
The code we use to do this and the file format is honestly a bit of mess (it's grown slowly over many years), so it's not suitable to be open-sourced. It certainly makes sense to have an open source library to do this, and it sounds similar to what your code is aiming at. But I agree with Kenton that it should not be part of the Protocol Buffer library -- it should be a separate project. It doesn't even need to be directly connected to Protocol Buffers -- you can use the same format for any kind of record. Daniel On Fri, Dec 11, 2009 at 2:20 PM, Jacob Rief <[email protected]> wrote: > Hello Chris, > > 2009/12/10 Christopher Smith <[email protected]>: > > One compression algo that I thought would be particularly useful with > PB's > > would be LZO. It lines up nicely with PB's goals of being fast and > compact. > > Have you thought about allowing an integrated LZO stream? > > > > --Chris > > My goal is to compress huge amounts >5GB of small serialized chunks > (~150...500 Bytes) into a single stream, and still being able to > randomly access each part of it without having to decompress to whole > stream. GzipOutputStream (with level 5) reduces the size to about 40% > compared to the uncompressed binary stream, whereas my > LzipOutputStream (with level 5) reduces the size to about 20%. The > difficulty with gzip is to find synchronizing boundaries in the stream > during uncompression > If your aim is to exchange small messages, say by RPC, than a fast but > less efficient algorithm is the right choice. If however you want to > store huge amounts of data permanently, your requirements may be > different. > > In my opinion, generic streaming classes such as > ZeroCopyIn/OutputStream, shall offer different compression algorithms > for different purposes. LZO has advantages if used for communication > of small to medium sized chunks of data. LZMA on the other hand has > advantages if you have to store lots of data for a long term. GZIP is > somewhere in the middle. Unfortunately Kenton has another opinion > about adding too many compression streaming classes. > > Today I studied the API of LZO. From what I have seen, I think one > could implement two LzoIn/OutputStream classes. LZO compression > however has a small drawback, let me explain why: The LZO API is not > intended to be used for streams. Instead it always compresses and > decompresses a whole block. This is different behaviour than gzip and > lzip, which are intended to compress streams. A compression class has > a fixed sized buffer of typically 8 or 64kB. If this buffer is filled > with data, lzip and gzip digest the input and you can start to fill > the buffer from the beginning. On the other hand, the LZO compressor > has to compress the whole buffer in one step. The next block then has > to be concatenated with the already compressed data, which means that > during decompression you have to fiddle these chunks apart. > > If your intention is to compress a chunk of data with, say less than > 64kB each, and then to put it on the wire, then LZO is the right > solution for you. For my requirements, as you will understand now, LZO > does not really fit well. > If there is a strong interest in an alternative Protocol Buffer > compression stream, don't hesitate to contact me. > > Jacob > > -- > > You received this message because you are subscribed to the Google Groups > "Protocol Buffers" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<protobuf%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/protobuf?hl=en. > > > -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
