I've opened.. https://issues.apache.org/jira/browse/AVRO-27
in order to discuss the potential of a COBS encoded container. Jira let's me make pretty tables to express the format. :) -Matt On Mon, May 4, 2009 at 10:43 AM, Doug Cutting <[email protected]> wrote: > Matt Massie wrote: > >> (1) Why is a meta block not required to immediately follow the header? It >> seems counter-intuitive to me that the requirement is for a meta block at >> the end of the file. I would expect that we would want the reader to know >> the meta data before any data blocks are sent (especially if you want to >> transfer/decode a file on the fly over the wire). >> > > I wanted to support append. The idea is that the meta-block is written each > time the file is flushed, and that the file is valid through the last > meta-block. If a program crashes without flushing, then the file might be > truncated to its last meta-block, and then new data could be appended. > > I did not have streaming in mind. The meta-block was intended to be > retroactive. I contains the total number of entries. If the file's schema > is a union, one might add new entries into the union as the file is written, > as new types are added, writing the full union when the file is flushed. > > To better ssupport streaming we might require that the metadata be written > every so many bytes, so that a streaming application could buffer until it > sees a metadata entry, then process the buffered items. Could that work? > Note that automatic periodic metadata writes would also facilitate crash > recovery, since one would know how far from the end of the file to begin > scanning. > > (2) Why are we using a 16-byte UUID block delimiter to mark block >> boundaries >> instead of using say Consistent Overhead Byte Stuffing (COBS)? >> > > In part because I didn't yet know about COBS when I implemented this > container. (You're the second person to ask this.) Perhaps we should > instead use COBS. The only potential disadvantage I see is that COBS seems > to require byte-by-byte processing. When projecting records to a subset > schema, we've seen huge speedups if we skip through data by chunks, passing > over strings, blobs, etc. just incrementing the file pointer. So I worry > that COBS, while quite fast, might add a significant cost to such skipping. > Compression also requires byte-by-byte processing, but provides more > tangible value than COBS. So COBS would need to add truly negligible CPU > overhead, which it might. > > COBs >> also allows for in-place encoding and decoding with very little overhead >> and >> copying (see the paper for test results). >> > [ ... ] > > Thanks for all your COBS analysis! It does seem attractive. Skimming the > paper, I see mostly space overhead benchmarks, not CPU overhead. In > projection benchmarks, we saw a 4x speedup from just skipping strings. > AVRO-25 proposes to make entire arrays and maps skippable, which should, > e.g., speed the projection of a single field from large, complex records by > 10x or more. But if compression is always used, perhaps the right > comparison is with something like LZO. If COBS is negligible next to LZO, > that could be good enough. > > Hopefully we'll develop a benchmark suite for Avro that includes projection > & compression, so we can more easily evaluate things like this. > > Doug >
