[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189402#comment-15189402 ]
Niels Basjes commented on AVRO-1704: ------------------------------------ I've been looking into what kind of solution would work here since I'm working on a project where we need datastructures going into Kafka and be available to multiple consumers. The fundamental problem we need to solve is that of "Schema Evolution" in a streaming environment (Let's assume Kafka with the built in persistence of records). We need three things to make this happen: # A way to recognize a 'blob' is a serialized AVRO record. #* We can simply assume it is always an AVRO record. #* I think we should simply let such a record start with "AVRO" to ensure we can cleanly catch problems like this STORM-512 (Summary: Timer ticks we written into Kafka which caused a lot of deserialization errors in reading the AVRO records.) # A way to determine the schema this was written with. #* As indicated above I vote for using the CRC-64-AVRO. #** I noticed that a simple typo fix in the documentation of a Schema causes a new fingerprint to be generated. #** Proposal: I think we should 'clean' the schema before calculating the fingerprint. I.e. remove the things that do not impact the binary form of the record (like the doc field). # Have a place where we can find the schemas using the fingerprint as the key. #* Here I think (looking at AVRO-1124 and the fact that there are ready to run implementations like this [Schema Registry|http://docs.confluent.io/current/schema-registry/docs/index.html]) we should limit what we keep inside Avro to something like a "SchemaFactory" interface (as the storage/retrieval interface to get a Schema) and a very basic implementation that simply reads the available schema's from a (set of) property file(s). Using this others can write additional implementations that can read/write to things like databases or the above mentioned Schema Registry. So to summarize my proposal on the standard for the {{Single record serialization format}} can be written as: {code}"AVRO"<CRC-64-AVRO(Normalized Schema)><regular binary form of the actual record>{code} [~rdblue], I'm seeking feedback from you guys on this proposal. > Standardized format for encoding messages with Avro > --------------------------------------------------- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)