[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244991#comment-15244991
 ] 

Ryan Blue commented on AVRO-1704:
---------------------------------

Sorry if what I said wasn't clear. I'm not proposing that we get rid of the 
header. I'm saying that we make it one byte instead of 4. I think what I 
outlined addresses the case where the schema cache miss is expensive and 
balances that with the per-message overhead. (I'm fine moving forward with the 
FP considered part of the body.)

A one-byte header results in lower than a 1/256 chance of an expensive lookup 
(by choosing carefully). Why is that too high? Why 4 bytes and not, for 
example, 2 for a 1/65536 chance?

I disagree that the impact of extra bytes is too small to matter. It (probably) 
won't cause fragmentation when sending one message, but we're not talking about 
just one message. Kafka's performance depends on batching records together for 
network operations and each message takes up space on disk. What matters is the 
percentage of data that is overhead. 4 bytes if your messages are 500 is 0.8%, 
and it is 4% if your messages are 100 bytes.

In terms of how much older data I can keep in a Kafka topic, that accounts for 
11m 30s to 57m 30s per day. If I provision for a 3-day window of data in Kafka, 
I'm losing between half an hour and 3 hours of that just to store 'Avr0' over 
and over. That's why I think we have to strike a balance between the two 
concerns. 1 or 2 bytes should really be sufficient, depending on the 
probability of a false-positive we want. And false-positives are only that 
costly if each one causes an RPC, which we can avoid with a little failure 
detection logic.

> Standardized format for encoding messages with Avro
> ---------------------------------------------------
>
>                 Key: AVRO-1704
>                 URL: https://issues.apache.org/jira/browse/AVRO-1704
>             Project: Avro
>          Issue Type: Improvement
>            Reporter: Daniel Schierbeck
>            Assignee: Niels Basjes
>         Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to