[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244347#comment-15244347
 ] 

Ryan Blue commented on AVRO-1704:
---------------------------------

Looks like I was a little too optimistic about time to review things this week. 
Sorry for the delay. I think we're close to a spec. Here are some additional 
thoughts.

Looks like everyone is for using the CRC-64-AVRO fingerprint, which is good 
because it can be implemented in each language and doesn't require a library 
dependency. That's also what's often used in practice.

+1 for an interface in Avro that lets you plug in a schema resolver.

I think the fingerprint should be considered part of the header rather than the 
body. It's a small distinction, but the fingerprint is a proxy for the schema 
here and the body/payload depends on it. Schema is in the container file 
header, so it is consistent.

I want to avoid a 4-byte sentinel value in each message. There are two uses for 
it: to make sure the message is Avro and to communicate the format version 
should we want to change it later.

Because the schema fingerprint is included in the message, it is very unlikely 
that unknown payloads will be read as Avro messages because it requires a 
collision with an 8-byte schema fingerprint. I think that's plenty of 
protection from passing along corrupt data. The concern that doesn't address is 
what happens when a fingerprint is unknown, which is a lot of cases will cause 
a REST call to resolve it. I don't think adding 4 bytes to every encoded 
payload is worth avoiding this case when the lookup can detect some number of 
failures and stop making the RPC calls. I just don't think we should design 
around a solvable problem in the format like that.

I think the second use, versioning the format, is a good idea. That only 
requires one byte and including that byte can also serve as a way to detect 
non-Avro payloads, just with a higher probability of collision. I think that's 
a reasonable compromise. There would be something a 1/256 chance that the first 
byte collides, assuming that byte is random in the non-Avro payload. That 
dramatically reduces the problem of making RPC calls to resolve unknown schema 
FPs. We want to choose the version byte carefully because other formats could 
easily have 0x00, 0x01, or an ASCII character there. I propose the version 
number with the MSB set, 0x80. That's unlikely to conflict with a flags byte, 
the first byte of a number, or the first character of a string.

That makes the format:
{code}
message = header body
 header = 0x80 CRC-64-AVRO(schema) (8 bytes, little endian)
   body = encoded Avro bytes using schema
{code}

We could additionally have a format with a 4-byte FP, version 0x81, if anyone 
is interested in it. Something simple like XOR the first 4 bytes with the 
second 4 bytes of the CRC-64-AVRO fingerprint. 8 bytes just seems like a lot 
when this gets scaled up to billions of records.

One last thought: in the implementation, it would be nice to allow skipping the 
version byte because a lot of people have already implemented this as 
CRC-64-AVRO + encoded bytes. That would make the Avro implementation compatible 
with existing data flows and increase the chances that we can move to this 
standard format.

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>         Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to