Re: Object Container Files format questions

Matt Massie Wed, 06 May 2009 15:47:08 -0700

I've opened..

https://issues.apache.org/jira/browse/AVRO-27


in order to discuss the potential of a COBS encoded container.  Jira let's
me make pretty tables to express the format.  :)

-Matt


On Mon, May 4, 2009 at 10:43 AM, Doug Cutting <[email protected]> wrote:

> Matt Massie wrote:
>
>> (1) Why is a meta block not required to immediately follow the header?  It
>> seems counter-intuitive to me that the requirement is for a meta block at
>> the end of the file.  I would expect that we would want the reader to know
>> the meta data before any data blocks are sent (especially if you want to
>> transfer/decode a file on the fly over the wire).
>>
>
> I wanted to support append. The idea is that the meta-block is written each
> time the file is flushed, and that the file is valid through the last
> meta-block.  If a program crashes without flushing, then the file might be
> truncated to its last meta-block, and then new data could be appended.
>
> I did not have streaming in mind.  The meta-block was intended to be
> retroactive.  I contains the total number of entries.  If the file's schema
> is a union, one might add new entries into the union as the file is written,
> as new types are added, writing the full union when the file is flushed.
>
> To better ssupport streaming we might require that the metadata be written
> every so many bytes, so that a streaming application could buffer until it
> sees a metadata entry, then process the buffered items.  Could that work?
>  Note that automatic periodic metadata writes would also facilitate crash
> recovery, since one would know how far from the end of the file to begin
> scanning.
>
>  (2) Why are we using a 16-byte UUID block delimiter to mark block
>> boundaries
>> instead of using say Consistent Overhead Byte Stuffing (COBS)?
>>
>
> In part because I didn't yet know about COBS when I implemented this
> container.  (You're the second person to ask this.)  Perhaps we should
> instead use COBS.  The only potential disadvantage I see is that COBS seems
> to require byte-by-byte processing.  When projecting records to a subset
> schema, we've seen huge speedups if we skip through data by chunks, passing
> over strings, blobs, etc. just incrementing the file pointer.  So I worry
> that COBS, while quite fast, might add a significant cost to such skipping.
>  Compression also requires byte-by-byte processing, but provides more
> tangible value than COBS.  So COBS would need to add truly negligible CPU
> overhead, which it might.
>
>  COBs
>> also allows for in-place encoding and decoding with very little overhead
>> and
>> copying (see the paper for test results).
>>
> [ ... ]
>
> Thanks for all your COBS analysis!  It does seem attractive.  Skimming the
> paper, I see mostly space overhead benchmarks, not CPU overhead.  In
> projection benchmarks, we saw a 4x speedup from just skipping strings.
> AVRO-25 proposes to make entire arrays and maps skippable, which should,
> e.g., speed the projection of a single field from large, complex records by
> 10x or more.  But if compression is always used, perhaps the right
> comparison is with something like LZO.  If COBS is negligible next to LZO,
> that could be good enough.
>
> Hopefully we'll develop a benchmark suite for Avro that includes projection
> & compression, so we can more easily evaluate things like this.
>
> Doug
>

Re: Object Container Files format questions

Reply via email to