[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244947#comment-15244947
 ] 

Niels Basjes commented on AVRO-1704:
------------------------------------

A few of the thoughts I had when creating the current patch:
# Regarding the 'Avro' header (which I still believe to be 'the way to go')
#* The cost of going to the Schema registry is high on a 'cache mis'. Problems 
like I ran into with STORM-512 will occur in other systems too and may very 
well cause an overload on the schema registry.
#* I consider the cost of a fixed header of 4 bytes to be low. But that really 
depends on the size of the record being transmitted (my records are in the 
500-1000 bytes range).
#** These extra bytes will only be persisted in streaming systems like Kafka. 
Long term file formats (like AVRO, Parquet and ORC) won't store this.
#** In network traffic the overhead is 'unmeasurably small' because it is 
unlikely the record will go over the size of a single TCP packet (1500) because 
of these 4 bytes.
# Regarding the schema fingerprint (which I consider a 'body' part).
#* The idea of the 'version' was that someone may want to use a different 
'hash' instead of the CRC-64-AVRO.
#* I think that in case of encryption we should have the fingerprint encrypted 
too.

*In light of the encryption option and your comments I'm now considering this 
_brainwave_*:
* The 'header of the message' should be pluggable.
** The default is a 'fixed shape' which includes a format id. (Same as what my 
current patch does).
** I expect that making this pluggable too is possible but that would have some 
restrictions like "all records of a schema must adhere to the same base format".
* The 'body of the message' should be pluggable too. 
** Format '0' is hardcoded (fingerprint+record). 
** Yet other versions (we should define a range like 0x80-0xFF) can be used by 
anyone to define a custom body definition (including encryption). I expect 
these versions to only exist within a specific company. If they need to 
exchange data with others they should share their format specification anyway.
* If we set the code up right we can have a layering system: I.e. someone can 
'insert' an encryption layer and still use the 'standard' body (after 
decryption).
** Such an 'encryption layer' would add additional parts like a encryption type 
and a key id.


> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>         Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to