[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271178#comment-15271178
 ] 

Doug Cutting commented on AVRO-1704:
------------------------------------

A few quick comments:
- A prefix with non-printing characters has the benefit of making it clear this 
is binary data and should not be treated as text.  This may or may not matter 
here, but, for example, it is useful that there are non-printing characters at 
the start of a data file so that applications don't ever guess that these are 
text and subject to CRLF manipulation, etc.  Or, if instead, we want it to be 
printable, we should perhaps just use standard ASCII 'A' and '>'.  I don't see 
the advantage of using 'rare' printing characters, that just seems confusing to 
me.
- the changes to Schema#hashCode() may have performance implications, so we 
should at least run the Perf.java benchmarks before this is committed
- getFingerprint() needs javadoc
- invalidateHashes() is package-private, should be private
- SingleRecordSerializer is specific to SpecificRecord, so perhaps belongs in 
the specific package?
- Is this really for records only, or for any object?
- maybe the base class/interface should be called MessageEncoder instead of 
RecordSerializer, the package could be named 'message', and the storage could 
be called MessageSchemaRepo?
- the Xor example should be in a test package, not in the released library, no?

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>         Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to