[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739165#comment-14739165
 ] 

Ryan Blue commented on AVRO-1704:
---------------------------------

I think this is a good idea. Quite a few people are doing this already, but 
with ad-hoc formats. [~granthenke] and [~gwenshap] are probably interested in 
this topic as well.

I think the one that is the most widely used is simply the 8-byte schema 
fingerprint from Java (SHA256?) followed by the encoded bytes. For 
compatibility with existing data in Kafka, I'd recommend going with that unless 
we have good reason to change it. I think it's better to specify the 
fingerprint ahead of time so we don't waste space encoding which one (or 
requiring more complicated code).

That leaves the format version number and metadata map, keeping in mind that if 
we decide we need either one then we are breaking compatibility with existing 
data and tools -- that's not too bad, but we should be aware of it. I like the 
idea of a format version number, but it might be unnecessary. I'm interested to 
hear what you envision the key/value metadata would be used for, too.

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to