[
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Niels Basjes updated AVRO-1704:
-------------------------------
Status: Patch Available (was: Open)
During the last few weeks I spent some time figuring out what I think the
format should be. I created this patch which includes specification for the new
format, code generators for Java and unit tests that validate the format in
light of schema evolution and corrupt data.
I documented the new format as follows:
{quote}
Schema tagged Binary Encoding specification
The wrapper format consists of a header and a body.
The header is always the 4 bytes representing the UTF-8 of the word "Avro"
followed by a single byte indicating the version of the body format.
Version 0 of the body (currently the ONLY body format that has been defined)
consists of:
# the finger print (see the section about Schema Fingerprints of the schema (a
64 bit long) that was written in the same byte order as a long is when written
if it was a field in a record.
# the record serialized to byte using the binary encoding.
{quote}
Although I thing this is already "pretty good" I really think this needs your
comments and improvement suggestions.
Thanks.
> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
> Issue Type: Improvement
> Reporter: Daniel Schierbeck
> Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized,
> meaning that I can read and write data with minimal effort across the various
> languages in use in my organization. If there was a standardized format for
> encoding single values that was optimized for out-of-band schema transfer, I
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode
> datums in this format, as well as a MessageReader that, given a SchemaStore,
> would be able to decode datums. The reader would decode the fingerprint and
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed
> library users to inject custom backends. A simple, file system based one
> could be provided out of the box.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)