[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244347#comment-15244347 ]
Ryan Blue commented on AVRO-1704: --------------------------------- Looks like I was a little too optimistic about time to review things this week. Sorry for the delay. I think we're close to a spec. Here are some additional thoughts. Looks like everyone is for using the CRC-64-AVRO fingerprint, which is good because it can be implemented in each language and doesn't require a library dependency. That's also what's often used in practice. +1 for an interface in Avro that lets you plug in a schema resolver. I think the fingerprint should be considered part of the header rather than the body. It's a small distinction, but the fingerprint is a proxy for the schema here and the body/payload depends on it. Schema is in the container file header, so it is consistent. I want to avoid a 4-byte sentinel value in each message. There are two uses for it: to make sure the message is Avro and to communicate the format version should we want to change it later. Because the schema fingerprint is included in the message, it is very unlikely that unknown payloads will be read as Avro messages because it requires a collision with an 8-byte schema fingerprint. I think that's plenty of protection from passing along corrupt data. The concern that doesn't address is what happens when a fingerprint is unknown, which is a lot of cases will cause a REST call to resolve it. I don't think adding 4 bytes to every encoded payload is worth avoiding this case when the lookup can detect some number of failures and stop making the RPC calls. I just don't think we should design around a solvable problem in the format like that. I think the second use, versioning the format, is a good idea. That only requires one byte and including that byte can also serve as a way to detect non-Avro payloads, just with a higher probability of collision. I think that's a reasonable compromise. There would be something a 1/256 chance that the first byte collides, assuming that byte is random in the non-Avro payload. That dramatically reduces the problem of making RPC calls to resolve unknown schema FPs. We want to choose the version byte carefully because other formats could easily have 0x00, 0x01, or an ASCII character there. I propose the version number with the MSB set, 0x80. That's unlikely to conflict with a flags byte, the first byte of a number, or the first character of a string. That makes the format: {code} message = header body header = 0x80 CRC-64-AVRO(schema) (8 bytes, little endian) body = encoded Avro bytes using schema {code} We could additionally have a format with a 4-byte FP, version 0x81, if anyone is interested in it. Something simple like XOR the first 4 bytes with the second 4 bytes of the CRC-64-AVRO fingerprint. 8 bytes just seems like a lot when this gets scaled up to billions of records. One last thought: in the implementation, it would be nice to allow skipping the version byte because a lot of people have already implemented this as CRC-64-AVRO + encoded bytes. That would make the Avro implementation compatible with existing data flows and increase the chances that we can move to this standard format. > Standardized format for encoding messages with Avro > --------------------------------------------------- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Reporter: Daniel Schierbeck > Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)