[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271360#comment-15271360 ]
Doug Cutting commented on AVRO-1704: ------------------------------------ A few more comments: - I think we can move all of the SpecificRecord#toBytes() and #fromBytes() code to SpecificRecordBase instead of generating it for each class. I prefer to minimize generated code. This might look like:{code} public class SpecificRecordBase<T extends SpecificRecordBase> { ... public T fromBytes(byte[]) { return (T)...; } } public class Player extends SpecificRecordBase<Player> { ... } {code} - I suspect using DataInputStream and DataOutputStream in public APIs may be problematic for performance long-term. Maybe the only public API in the first version should be 'T fromMessage(byte[])' and 'byte[] toMessage(T)'? This can then be optimized, and, if needed a higher-performance lower-level API can be added. - We should implement this API for more than just specific data. This should work for generic data, Thrift, protobuf, etc., producing an identical format. So the base implementation should be passed a GenericData, which all of these inherit from, since it can create an appropriate DatumReader or DatumWriter. So this might look something like:{code} package org.apache.avro.data; public class MessageCoder<T> { private GenericData data; public MessageCoder(GenericData data, MessageSchemaRepo repo) { this.data = data; } public byte[] toMessage(T object) { ... } public T fromMessage(byte[] bytes) { ... } }{code} - Permitting alternate schema repos and alternate in-memory object representations is important, but supporting alternate message formats is not. The goal here is to standardize a message format, so I would not design things for extensibility on that axis. > Standardized format for encoding messages with Avro > --------------------------------------------------- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Reporter: Daniel Schierbeck > Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)