[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

Doug Cutting (JIRA) Tue, 28 Jun 2016 12:59:58 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353614#comment-15353614
 ]


Doug Cutting commented on AVRO-1704:
------------------------------------

That decoder interface seems particularly wide.  Might these be better as base 
classes rather than interfaces?  What power does the interface add?

The initial implementations also have hidden performance pitfalls; some 
operations allocate streams & arrays for every call.  We might either go with a 
lean-and-mean API, or make sure that all of the supported invocations are 
efficient.  I'd prefer inefficiencies be manifest, forcing clients to allocate 
streams per call rather than folks assuming they're using a 
ByteBuffer-optimized API.

To optimize these in a thread-safe manner I think we'd add a 
ThreadLocal<ByteArrayInput/OutputStream> field, right?

Do we really need the raw format support?  This is supported by the existing 
API.  The primary goal here is to add support for a new, non-raw "message" 
format.

Without the interface & the raw format, this could become just two utility 
classes, MessageEncoder and MessageDecoder.  Is that too reductive?

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>         Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

Reply via email to