[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189402#comment-15189402
 ] 

Niels Basjes commented on AVRO-1704:
------------------------------------

I've been looking into what kind of solution would work here since I'm working 
on a project where we need datastructures going into Kafka and be available to 
multiple consumers.

The fundamental problem we need to solve is that of "Schema Evolution" in a 
streaming environment (Let's assume Kafka with the built in persistence of 
records).
We need three things to make this happen:
# A way to recognize a 'blob' is a serialized AVRO record.
#* We can simply assume it is always an AVRO record. 
#* I think we should simply let such a record start with "AVRO" to ensure we 
can cleanly catch problems like this STORM-512 (Summary: Timer ticks we written 
into Kafka which caused a lot of deserialization errors in reading the AVRO 
records.)
# A way to determine the schema this was written with.
#* As indicated above I vote for using the CRC-64-AVRO. 
#** I noticed that a simple typo fix in the documentation of a Schema causes a 
new fingerprint to be generated. 
#** Proposal: I think we should 'clean' the schema before calculating the 
fingerprint. I.e. remove the things that do not impact the binary form of the 
record (like the doc field).
# Have a place where we can find the schemas using the fingerprint as the key.
#* Here I think (looking at AVRO-1124 and the fact that there are ready to run 
implementations like this [Schema 
Registry|http://docs.confluent.io/current/schema-registry/docs/index.html]) we 
should limit what we keep inside Avro to something like a "SchemaFactory" 
interface (as the storage/retrieval interface to get a Schema) and a very basic 
implementation that simply reads the available schema's from a (set of) 
property file(s). Using this others can write additional implementations that 
can read/write to things like databases or the above mentioned Schema Registry.

So to summarize my proposal on the standard for the {{Single record 
serialization format}} can be written as:
{code}"AVRO"<CRC-64-AVRO(Normalized Schema)><regular binary form of the actual 
record>{code}

[~rdblue], I'm seeking feedback from you guys on this proposal. 


> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to