Hello Avro developers- Forgive me if I'm (re)opening questions that have already been discussed.
I'd like to understand the design decisions that went into the Avro Object Container Files format. (1) Why is a meta block not required to immediately follow the header? It seems counter-intuitive to me that the requirement is for a meta block at the end of the file. I would expect that we would want the reader to know the meta data before any data blocks are sent (especially if you want to transfer/decode a file on the fly over the wire). (2) Why are we using a 16-byte UUID block delimiter to mark block boundaries instead of using say Consistent Overhead Byte Stuffing (COBS)? The paper on COBS can be found at http://www.sigcomm.org/sigcomm97/papers/p062.pdf including some nice C code at the end for encoding/decoding records (it's easy to find Java code for COBs too). A COBS-based block format might look like the following delimiter (a single byte set to 0) type (a non-zero zig-zag long that expresses the record type) length (a non-zero zig-zag long the expresses the length of the record) cobs_code (a single byte used to help encode/decode the payload) ... (cobs encoded data) The nice thing about zig-zag longs is that non-zero values will never be encoded with a single zero byte (correct?). This allows us to keep the simple 1 byte delimiter/record boundary (set to zero). Since the record type and length are not COBS encoded, we can quickly scan a file/socket/memory for the next record boundary and easily know the payload type and length (to decide whether to process or skip the record). COBs also allows for in-place encoding and decoding with very little overhead and copying (see the paper for test results). It make more sense to make that we use the same record boundary (0) for all Avro records instead of having them be random. The format would be more resilient to data corruption easier to parse. It's also possible (although improbable) that the 16-byte UUID might be part of the payload... especially given the size of the data Hadoop processes. -Matt
