[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

Doug Cutting (JIRA) Wed, 04 May 2016 13:16:50 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271360#comment-15271360
 ]


Doug Cutting commented on AVRO-1704:
------------------------------------

A few more comments:
- I think we can move all of the SpecificRecord#toBytes() and #fromBytes() code 
to SpecificRecordBase instead of generating it for each class.  I prefer to 
minimize generated code.  This might look like:{code}
public class SpecificRecordBase<T extends SpecificRecordBase> {
  ...
  public T fromBytes(byte[]) { return (T)...; }
}
public class Player extends SpecificRecordBase<Player> {
  ... 
}
{code}
- I suspect using DataInputStream and DataOutputStream in public APIs may be 
problematic for performance long-term.  Maybe the only public API in the first 
version should be 'T fromMessage(byte[])' and 'byte[] toMessage(T)'?  This can 
then be optimized, and, if needed a higher-performance lower-level API can be 
added.
- We should implement this API for more than just specific data.  This should 
work for generic data, Thrift, protobuf, etc., producing an identical format.  
So the base implementation should be passed a GenericData, which all of these 
inherit from, since it can create an appropriate DatumReader or DatumWriter.  
So this might look something like:{code}
package org.apache.avro.data;
public class MessageCoder<T> {
   private GenericData data;
   public MessageCoder(GenericData data, MessageSchemaRepo repo) { this.data = 
data; }
   public byte[] toMessage(T object) { ... }
   public T fromMessage(byte[] bytes) { ... }
}{code}
 - Permitting alternate schema repos and alternate in-memory object 
representations is important, but supporting alternate message formats is not.  
The goal here is to standardize a message format, so I would not design things 
for extensibility on that axis.

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>         Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

Reply via email to