[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244947#comment-15244947 ]
Niels Basjes commented on AVRO-1704: ------------------------------------ A few of the thoughts I had when creating the current patch: # Regarding the 'Avro' header (which I still believe to be 'the way to go') #* The cost of going to the Schema registry is high on a 'cache mis'. Problems like I ran into with STORM-512 will occur in other systems too and may very well cause an overload on the schema registry. #* I consider the cost of a fixed header of 4 bytes to be low. But that really depends on the size of the record being transmitted (my records are in the 500-1000 bytes range). #** These extra bytes will only be persisted in streaming systems like Kafka. Long term file formats (like AVRO, Parquet and ORC) won't store this. #** In network traffic the overhead is 'unmeasurably small' because it is unlikely the record will go over the size of a single TCP packet (1500) because of these 4 bytes. # Regarding the schema fingerprint (which I consider a 'body' part). #* The idea of the 'version' was that someone may want to use a different 'hash' instead of the CRC-64-AVRO. #* I think that in case of encryption we should have the fingerprint encrypted too. *In light of the encryption option and your comments I'm now considering this _brainwave_*: * The 'header of the message' should be pluggable. ** The default is a 'fixed shape' which includes a format id. (Same as what my current patch does). ** I expect that making this pluggable too is possible but that would have some restrictions like "all records of a schema must adhere to the same base format". * The 'body of the message' should be pluggable too. ** Format '0' is hardcoded (fingerprint+record). ** Yet other versions (we should define a range like 0x80-0xFF) can be used by anyone to define a custom body definition (including encryption). I expect these versions to only exist within a specific company. If they need to exchange data with others they should share their format specification anyway. * If we set the code up right we can have a layering system: I.e. someone can 'insert' an encryption layer and still use the 'standard' body (after decryption). ** Such an 'encryption layer' would add additional parts like a encryption type and a key id. > Standardized format for encoding messages with Avro > --------------------------------------------------- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Reporter: Daniel Schierbeck > Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)