Re: Object Container Files format questions

Doug Cutting Mon, 04 May 2009 10:43:35 -0700

Matt Massie wrote:

(1) Why is a meta block not required to immediately follow the header?  It
seems counter-intuitive to me that the requirement is for a meta block at
the end of the file.  I would expect that we would want the reader to know
the meta data before any data blocks are sent (especially if you want to
transfer/decode a file on the fly over the wire).

I wanted to support append. The idea is that the meta-block is writteneach time the file is flushed, and that the file is valid through thelast meta-block. If a program crashes without flushing, then the filemight be truncated to its last meta-block, and then new data could beappended.

I did not have streaming in mind. The meta-block was intended to beretroactive. I contains the total number of entries. If the file'sschema is a union, one might add new entries into the union as the fileis written, as new types are added, writing the full union when the fileis flushed.

To better ssupport streaming we might require that the metadata bewritten every so many bytes, so that a streaming application couldbuffer until it sees a metadata entry, then process the buffered items.Could that work? Note that automatic periodic metadata writes wouldalso facilitate crash recovery, since one would know how far from theend of the file to begin scanning.

(2) Why are we using a 16-byte UUID block delimiter to mark block boundaries
instead of using say Consistent Overhead Byte Stuffing (COBS)?

In part because I didn't yet know about COBS when I implemented thiscontainer. (You're the second person to ask this.) Perhaps we shouldinstead use COBS. The only potential disadvantage I see is that COBSseems to require byte-by-byte processing. When projecting records to asubset schema, we've seen huge speedups if we skip through data bychunks, passing over strings, blobs, etc. just incrementing the filepointer. So I worry that COBS, while quite fast, might add asignificant cost to such skipping. Compression also requiresbyte-by-byte processing, but provides more tangible value than COBS. SoCOBS would need to add truly negligible CPU overhead, which it might.

COBs
also allows for in-place encoding and decoding with very little overhead and
copying (see the paper for test results).

[ ... ]

Thanks for all your COBS analysis! It does seem attractive. Skimmingthe paper, I see mostly space overhead benchmarks, not CPU overhead. Inprojection benchmarks, we saw a 4x speedup from just skipping strings.AVRO-25 proposes to make entire arrays and maps skippable, which should,e.g., speed the projection of a single field from large, complex recordsby 10x or more. But if compression is always used, perhaps the rightcomparison is with something like LZO. If COBS is negligible next toLZO, that could be good enough.

Hopefully we'll develop a benchmark suite for Avro that includesprojection & compression, so we can more easily evaluate things like this.


Doug

Re: Object Container Files format questions

Reply via email to