[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247442#comment-15247442
 ] 

Niels Basjes commented on AVRO-1704:
------------------------------------

I agree with what you are saying. So the header should be shorter, but not too 
short.
I think that having only 1 byte is too short, 2 bytes should be fine: 1 marker 
byte, 1 body version byte.

So the updated proposal becomes:
* Header becomes 2 bytes in total. 'Ã' '<body version byte>'
** I chose the à (0xC3) because 
*** It is a 'human readable character' 
*** it looks like an 'A' (from Avro) under a 'Wave' and since currently the 
primary use case is streaming this seems like the right marker. 
*** Also this is a very uncommon character so if we see this the collision 
probability drops dramatically.
** The '<body version byte>' can be any byte that essentially defines the 
record structure that follows. This can be used to indicate for example the 
difference between a normal record and an encrypted record.
*** I think that we should also pick an 'uncommon' byte for this one to mark 
the default record version. I think this one is a good candicate: '»' (0xBB) 
because it looks like a symbol for 'fast'.
* The default body (i.e. version 0xBB) becomes 
** body: fingerprint record
*** fingerprint = CRC-64-AVRO(normalized schema) (8 bytes, little endian)
*** record = encoded Avro bytes using schema

So the overall record using the default body structure would look like this:
{code}
message = header body
 header = 'û' (== 0xC3 0xBB)
   body = <CRC-64-AVRO(Normalized(schema)) (8 bytes, little endian)> <encoded 
Avro bytes using schema>
{code}

In the generated code I'll see what can be done to make both the header and 
body code 'pluggable'.
I think that the Schema Storage should get a capped 'cache' (LRU?) that retains 
the fingerprints that are 'known to not exist'.



> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>         Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to