[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033078#comment-16033078 ] Sean Busbey commented on AVRO-1704: --- I think that's because the fix version wasn't properly set when it got closed out. I've updated it to be 1.8.2 now, so it should be in the release notes. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.2 > > Attachments: AVRO-1704-20160410.patch, > AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033068#comment-16033068 ] Jacob Rideout commented on AVRO-1704: - Hmmm ... It looks like it is in the branch-1.8. I am confused since it is NOT listed in https://s.apache.org/avro-release-note-1.8.2 > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.2 > > Attachments: AVRO-1704-20160410.patch, > AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033065#comment-16033065 ] Sean Busbey commented on AVRO-1704: --- Looks like it's in 1.8.2 to me: http://avro.apache.org/docs/1.8.2/spec.html#single_object_encoding > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.2 > > Attachments: AVRO-1704-20160410.patch, > AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033060#comment-16033060 ] Sean Busbey commented on AVRO-1704: --- The JIRA is resolved and it was listed as a blocker for 1.8.2. Is it not actually in branch-1.8? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-20160410.patch, > AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033034#comment-16033034 ] Jacob Rideout commented on AVRO-1704: - What needs to be done to land this in 1.8.3? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-20160410.patch, > AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712516#comment-15712516 ] Ryan Blue commented on AVRO-1704: - You mean erring on the side of caution and using a larger hash? I don't think collisions with a 64-bit fingerprint are likely enough to cause any trouble. And, while you don't calculate the fingerprint every time, you do send it in the message. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712508#comment-15712508 ] Ryan Blue commented on AVRO-1704: - With a spec like this, we want to be careful about having too many things that must be implemented. I think there would have to be a very good reason to add additional hashes to the spec. If you're interested in using the Avro MessageEncoder and MessageDecoder, then that shouldn't be too difficult because the code is modular enough you can implement a decoder for your message format fairly easily. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572449#comment-15572449 ] radai rosenblatt commented on AVRO-1704: Also, since this is somewhat Kafka related, i would like to point to this kafka proposal for headers in the kafka wire format - https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers discussion thread is here - http://mail-archives.apache.org/mod_mbox/kafka-dev/201609.mbox/%3C1474572662302.81658%40ig.com%3E > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572398#comment-15572398 ] radai rosenblatt commented on AVRO-1704: At LinkedIn we use a similar scheme for our avro payloads over kafka, but we use a 128bit hash for schema identifier. Would it be possible to still make the hashing scheme changeable to make the transition easier for organizations not using 64bit schema ids? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15463478#comment-15463478 ] Ryan Blue commented on AVRO-1704: - Thanks for reviewing! > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15463379#comment-15463379 ] Sean Busbey commented on AVRO-1704: --- +1 on AVRO-1704.4.patch > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15461949#comment-15461949 ] Ryan Blue commented on AVRO-1704: - I'm marking this as a blocker for the 1.8.2 release because the code is committed. If we release the implementation, I think we should also include the spec changes. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15461947#comment-15461947 ] Ryan Blue commented on AVRO-1704: - [~busbey], could you have a look at the last patch I posted with the spec changes? I'd like to get it into 1.8.2 since the code is. Thank you! > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391974#comment-15391974 ] Sean Busbey commented on AVRO-1704: --- FWIW, I belatedly agree with Doug's statement. Do we have our compatibility promises documented somewhere? I feel like I have a good sense of them, but I don't know if that's just because I've been in the community for several years. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391193#comment-15391193 ] Ryan Blue commented on AVRO-1704: - I just committed the Java implementation, with additional Javadoc. This did not include the incompatible changes, which should be done in a separate issue. I also took the spec from [~nielsbasjes]'s patch and updated it: * Use "object" instead of "record" to be more clear that it doesn't have to be an Avro record * Use C3 01 for the header * Simplify the encoding spec as much as possible > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391117#comment-15391117 ] Ryan Blue commented on AVRO-1704: - Sounds good to me! I'll fix the missing Javadoc and remove that change. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391091#comment-15391091 ] Doug Cutting commented on AVRO-1704: We don't promise source-compatibility for minor Avro releases, but do for dot releases. So this should not go into 1.8.x but could go into 1.9.0. (An incompatible change to data formats would require a 2.0.) > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390333#comment-15390333 ] Ryan Blue commented on AVRO-1704: - For the createDatumReader/Writer change: it is [binary compatible because of type erasure|https://docs.oracle.com/javase/specs/jls/se7/html/jls-13.html#jls-13.4.13], but not source compatible. To work around not constructing these with the type parameter, some users will cast to the right type, like this: {code:lang=java} DatumReader reader = (DatumReader) GenericData.get().createDatumReader(schema); {code} That compiles in 1.8.1 because it is casting {{DatumReader}} to {{DatumReader}}, but not with this change. After the change, it returns a {{DatumReader}} that Java won't convert. The fix is to remove the cast and then Java correctly infers that the type parameter is {{GenericRecord}} instead of {{Object}}. Do we guarantee source compatibility? Even if we do not, [~busbey], what do you think about including this incompatibility? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389787#comment-15389787 ] Doug Cutting commented on AVRO-1704: +1 overall. Two minor questions: - Is the change to the createDatumReader/Writer API fully back-compatible? - I think a few of the new public methods don't have javadoc. It's probably worth building the docs and glancing through them to see how they look. That usually inspires a lot of improvements and is especially useful with new APIs like this. Other than that, LGTM. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387031#comment-15387031 ] Ryan Blue commented on AVRO-1704: - I agree with your reasoning on naming, so lets go with MessageEncoder. I think that's reasonably distinct from the other classes. By my builder comment, I meant that if we want to make it easier to instantiate a MessageDecoder we could add a builder rather than a factory method. That would make it easy to seed the decoder with compatible Schemas and select the GenericData subclass. Something like this: {code:lang=java} MessageDecoder decoder = MessageDecoder.builder() .read(MyDatum.class) .schema(oldSchema1) .schema(oldSchema2) .build(); {code} I don't think this is needed yet, since the constructors are fairly simple. I think the implementation is about ready to commit, followed by the spec updated for the 2-byte header used in this implementation. Is there anything else you think we should change? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382696#comment-15382696 ] Doug Cutting commented on AVRO-1704: I doubt we'll ever need this abstracted, and having it so might encourage a proliferation of message formats, but it might also prove useful someday, so I can live with that. However I still don't see the need for multiple levels of abstraction (interface + abstract base class). That still seems like vast overkill to me, but is probably not worth fighting about. As far as terminology, "datum" is used to refer to the in-memory data structure (generic, specific, reflect, Thrift, protobuf) while "encode/decode" refer to specific serialized formats (binary, json). A reader/writer translates between the in-memory structure and the abstract encoding API. So where does "message" fit into this taxonomy? I suppose it's a new serialized format, an extension of "binary", so "encode/decode" are probably more appropriate than "read/write". Not sure what you mean about the builder. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381561#comment-15381561 ] Ryan Blue commented on AVRO-1704: - I think this should be abstract. The format that we're adding solves one set of uses, but the utility methods have value beyond that. Encoding a single Avro record is fairly common, but the implementations vary widely in quality because it is difficult to find the right setup of DatumWriter, BinaryEncoder, and ByteArrayOutputStream. Simplifying and improving applications that already do this is a good thing. And some of those uses, like the case I mentioned where we're embedding Avro in Parquet records, don't need the header or schema at all because that's defined in the file metadata. The abstraction is also useful for transitioning to the format we're defining here. The normal way to encode messages in Kafka is the 8-byte fingerprint followed by the encoded message payload. With the abstraction, you can write a decoder that checks for the header and then deserializes, or assumes the old format if the header is missing. That would enable rolling upgrades using the same Kafka topics, rather than needing a hard transition. I would also include the abstraction in case we want to change or introduce a new format later. bq. I also worry that names like BinaryDatumDecoder I've pushed a new commit that moves the classes to org.apache.avro.message and renames them to MessageEncoder and MessageDecoder. I think used "encoder" instead of "reader" to contrast with the DatumReader and DatumWriter, since there is little difference between a datum and a message (a datum to encode by itself). bq. Perhaps [the reusable i/o straems] should go in the util package so they can be used more widely? I've moved them there. I avoided it before so that they weren't added to the public API, but I think it's fine to make them available. bq. We might also add utilities for generic & reflect, like, model#getMessageWriter(Schema)? I looked at this, but then the GenericData classes would have both createDatumWriter and getMessageWriter, which looks confusing to me. Keeping the DatumEncoder above the level of the data models helps separate the DatumWriter from the MessageEncoder. If we want to make instantiating these easier, then maybe a builder would be more appropriate. That would allow us to pass multiple writer schemas to the MessageDecoder. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371389#comment-15371389 ] Doug Cutting commented on AVRO-1704: I don't see why anyone would prefer the interface to the abstract base class. It seems like belt and suspenders (https://youtube.com/watch?v=VuWzeoIr7J4). Who do we imagine would implement this outside of the project? Frankly, I question this needs even be abstract. Applications will use this API because they want to use Avro's tagged binary encoding for messages. Applications that want an untagged binary encoding can use the existing APIs. The in-memory format is already abstracted, and the encoding is fixed. What we're providing here isn't an extensible framework, it's some utility code. Folks who seek to optimize away the 10-byte overhead can use a DatumWriter & BinaryEncoder as they do today. That's an unsafe encoding and we needn't further simplify it. Our goal is to provide an easy-to-use, safe, standard encoding for messages. I also worry that names like BinaryDatumDecoder are confusing, when we already have BinaryDecoder and DatumReader. We might instead call a so-prefixed binary encoded datum a "message", and have MessageWriter and MessageReader classes that implement this and a MessageSchemaStore, perhaps even placing these all in a new "message" package. I won't reject this patch over these differences in style. I prefer to not hide things behind abstractions until there's clear need. At that point, when multiple implementations are required, one has a better idea of what the abstraction should be. In the mean time, code is substantially smaller, easier to read, debug, maintain, etc. But this is a style issue where reasonable folks might differ. It's hard to believe we don't already have reusable array i/o streams around! Perhaps these should go in the util package so they can be used more widely? I like the convenience methods generated for specific data. We might also add utilities for generic & reflect, like, model#getMessageWriter(Schema)? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369379#comment-15369379 ] Ryan Blue commented on AVRO-1704: - Forgot to add: I've kept the new commits separate so you can see what changed. I'll squash them into the implementation when it is time to commit to master if this implementation is accepted. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369378#comment-15369378 ] Ryan Blue commented on AVRO-1704: - [~cutting], I've pushed a couple new commits to the pull request. The changes include: * Add ReusableByteBufferInputStream and ReusableByteArrayInputStream * Make the encoder and decoder instances thread-safe * Remove the thread-local encoder from Specific because the static encoder and decoder are now thread-safe * Add tests using generic That addresses the review feedback other than the question of whether to use an interface or an abstract class. I think the patch has the best of both options by including both an interface and an abstract base class (DatumDecoder.BaseDecoder) that implementations can use to cut down on boilerplate and maintain compatibility. That leaves the choice up to the implementer. If you have a strong opinion here, I can change it but I think having both is a good solution. Also, some of the tests are ignored because they don't pass without a modification to the ResolvingGrammarGenerator. Aliases don't appear to be working. I'm opening another issue with a patch for it. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353906#comment-15353906 ] Doug Cutting commented on AVRO-1704: I think all the methods are useful but some of them (e.g., non-reuse) will always be implemented by boilerplate and are thus not core to the interface, but rather something more suitable for a base class. An abstract base class would still permit independent alternative implementations. The only additional power an interface has is that one can implement multiple interfaces. But interfaces don't let you implement convenience methods, nor do they permit compatible evolution (if you ever add or remove a method, you break implementations, because you cannot provide default impls). But if you feel multiple inheritance is important here, then it's probably easier to stick to an interface than, e.g., refactor into encoder/decoder provider classes that are separate from the user-invoked classes or some other way to avoid such boilerplate implementations. Encoding to a ByteBuffer should be thread-safe, since it has no caller-visible state, no? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353737#comment-15353737 ] Ryan Blue commented on AVRO-1704: - I agree that the current interface is wide. I think we should have the datum reuse methods, which doubles the API. I think we definitely want the ByteBuffer methods. Do you think we don't need the InputStream methods? In the pull request there are also byte array methods, but it's easy for callers to use ByteBuffer instead. I like having the interface so that alternative implementations can be independent. There's no guarantee that Avro's base class is useful to implementers and I don't see a need to force people to inherit from an Avro class when it may not make sense. There's an optional base class for convenience, so I think the benefits outweigh the cost. +1 for getting rid of the performance pitfalls. I think we just need to find a reusable ByteArrayInputStream and make sure we can change the buffer list in ByteBufferInputStream. I'll look into it. For thread safety we can just make the reused state thread-local like you suggest. Right now the Specific methods use a thread-local DatumEncoder/DatumDecoder. Do you think the DatumEncoder implementations should be thread-safe? I think we do need the raw format. Right now there are a lot of systems already serializing Avro records in the equivalent of the raw format so I would like to have an Avro class that helps move to the new spec. Also, if the schema is fixed then there's no need for 10 extra bytes per payload so it is independently useful. For example, I use the raw format to store JSON payloads. The schema won't change and Avro is much smaller and faster. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353614#comment-15353614 ] Doug Cutting commented on AVRO-1704: That decoder interface seems particularly wide. Might these be better as base classes rather than interfaces? What power does the interface add? The initial implementations also have hidden performance pitfalls; some operations allocate streams & arrays for every call. We might either go with a lean-and-mean API, or make sure that all of the supported invocations are efficient. I'd prefer inefficiencies be manifest, forcing clients to allocate streams per call rather than folks assuming they're using a ByteBuffer-optimized API. To optimize these in a thread-safe manner I think we'd add a ThreadLocal field, right? Do we really need the raw format support? This is supported by the existing API. The primary goal here is to add support for a new, non-raw "message" format. Without the interface & the raw format, this could become just two utility classes, MessageEncoder and MessageDecoder. Is that too reductive? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352355#comment-15352355 ] Ryan Blue commented on AVRO-1704: - [~nielsbasjes], sorry it's taken so long for me to get back to you on this. On the spec: * I think we should go with the header: 0xC3 0x01. The first byte makes it easily recognizable as you suggest and meets my requirement of minimizing the number of non-Avro payloads that match. Using 0x01 makes it easy to see the version and will prevent programs confusing payloads with text as Doug suggests. * I don't see much value in reserving space in the second byte. I don't think there will be many formats for serializing Avro payloads and I don't think we will have problems with collision. I've had a look at your patch and there's a lot in there: an update to the spec, an implementation, an XOR demo, changes to Schema hashing, specific support, and static default classes. I think it would be helpful to get this in by breaking up the work into separate patches, pull requests, or issues. I also think we should simply the API a bit. I'd like to keep it small and grow it was we need to keep the maintenance and compatibility simple. For example, SchemaStorage has open and close methods that are only used in a test. I'd rather not add life-cycle methods like those unless the life-cycle of a SchemaStorage needs to be managed by Avro. To that end, I think we can simplify the API and I propose the following API: {code:lang=java} interface SchemaStore { Schema findByFingerprint(long); } {code} I also think that the message API should be focused around a datum and a buffer or stream. The data model (GenericData instance) and other things can be passed in to create it and then reused for efficiency. I've actually implemented this already for a project that stores Avro-encoded payloads in Parquet so I've [adapted that implementation|https://github.com/apache/avro/pull/103] to look up fingerprints from a SchemaStore. The API is broken into encoder and decoder sides to deal with separate concerns: for the encoder that's how to manage buffers and for the decoder it's how to resolve schemas and datum reuse. {code:lang=java} interface DatumEncoder { DatumDecoder(GenericData model, Schema, boolean copyBuffer); ByteBuffer encode(D datum); // if copyBuffer was true, this is a new buffer void encode(D datum, OutputStream); } interface DatumDecoder { DatumDecoder(GenericData model, Schema, SchemaStore); D decode(ByteBuffer); D decode(ByteBuffer, D reuseDatum); D decode(InputStream); D decode(InputStream, D reuseDatum); } {code} My branch is broken into a few commits. The first two are bug fixes, but the third is [the DatumEncoder implementation, d91b905|https://github.com/apache/avro/pull/103/commits/d91b90544f4486a72da8d3ff5b81dfc3c79d7c2f], and the fourth is [support for the Specific data model, 7fa75aa|https://github.com/apache/avro/pull/103/commits/7fa75aab405c6460077d7cc7e403c664cce84431], based on your patch. I'd like to hear what you think of the DatumEncoder API in that branch. It implements a few things that I think we'll need, like datum reuse, and it reuses encoders, DatumWriters, and buffers. It implements two encoder/decoder pairs, "raw" that is just the datum bytes and "binary" that implements the header and schema lookup. Definitely needs some improvements, like more through tests and better naming, like Doug's suggestion to use "message". > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352338#comment-15352338 ] ASF GitHub Bot commented on AVRO-1704: -- GitHub user rdblue opened a pull request: https://github.com/apache/avro/pull/103 AVRO-1704: Add DatumEncoder API You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/avro AVRO-1704-add-datum-encoder-decoder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/avro/pull/103.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #103 commit 79a2993151ea7589c06b854ee7ac8e951816ecce Author: Ryan BlueDate: 2016-06-28T03:37:56Z AVRO-1869: Java: Fix Decimal conversion from ByteBuffer. commit 3ca6a15ddf75e4c39468ddd1d454331f3f54f1e3 Author: Ryan Blue Date: 2016-06-28T03:40:14Z AVRO-1704: Java: Add type parameter to createDatumReader and Writer. commit d91b90544f4486a72da8d3ff5b81dfc3c79d7c2f Author: Ryan Blue Date: 2016-06-28T03:41:40Z AVRO-1704: Java: Add DatumEncoder and SchemaStore. commit 7fa75aab405c6460077d7cc7e403c664cce84431 Author: Ryan Blue Date: 2016-06-28T03:44:06Z AVRO-1704: Java: Add toByteArray and fromByteArray to specific. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273205#comment-15273205 ] Niels Basjes commented on AVRO-1704: Thanks for the great feedback. I'm going to work on these points. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271360#comment-15271360 ] Doug Cutting commented on AVRO-1704: A few more comments: - I think we can move all of the SpecificRecord#toBytes() and #fromBytes() code to SpecificRecordBase instead of generating it for each class. I prefer to minimize generated code. This might look like:{code} public class SpecificRecordBase { ... public T fromBytes(byte[]) { return (T)...; } } public class Player extends SpecificRecordBase { ... } {code} - I suspect using DataInputStream and DataOutputStream in public APIs may be problematic for performance long-term. Maybe the only public API in the first version should be 'T fromMessage(byte[])' and 'byte[] toMessage(T)'? This can then be optimized, and, if needed a higher-performance lower-level API can be added. - We should implement this API for more than just specific data. This should work for generic data, Thrift, protobuf, etc., producing an identical format. So the base implementation should be passed a GenericData, which all of these inherit from, since it can create an appropriate DatumReader or DatumWriter. So this might look something like:{code} package org.apache.avro.data; public class MessageCoder { private GenericData data; public MessageCoder(GenericData data, MessageSchemaRepo repo) { this.data = data; } public byte[] toMessage(T object) { ... } public T fromMessage(byte[] bytes) { ... } }{code} - Permitting alternate schema repos and alternate in-memory object representations is important, but supporting alternate message formats is not. The goal here is to standardize a message format, so I would not design things for extensibility on that axis. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271178#comment-15271178 ] Doug Cutting commented on AVRO-1704: A few quick comments: - A prefix with non-printing characters has the benefit of making it clear this is binary data and should not be treated as text. This may or may not matter here, but, for example, it is useful that there are non-printing characters at the start of a data file so that applications don't ever guess that these are text and subject to CRLF manipulation, etc. Or, if instead, we want it to be printable, we should perhaps just use standard ASCII 'A' and '>'. I don't see the advantage of using 'rare' printing characters, that just seems confusing to me. - the changes to Schema#hashCode() may have performance implications, so we should at least run the Perf.java benchmarks before this is committed - getFingerprint() needs javadoc - invalidateHashes() is package-private, should be private - SingleRecordSerializer is specific to SpecificRecord, so perhaps belongs in the specific package? - Is this really for records only, or for any object? - maybe the base class/interface should be called MessageEncoder instead of RecordSerializer, the package could be named 'message', and the storage could be called MessageSchemaRepo? - the Xor example should be in a test package, not in the released library, no? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268928#comment-15268928 ] Ryan Blue commented on AVRO-1704: - Yeah, sorry about not replying yet. I haven't gotten a great chance for a review. My current thought is that I'm fine with 2 bytes and 0xC3. It seems strange to me to pick an arbitrary byte for the version, maybe it would be better to go with 0x00. Also, I have some code that I've been using that I want to compare with what you have here and think about the API since it will be a popular one. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254092#comment-15254092 ] Niels Basjes commented on AVRO-1704: Question: What would be the preferred way of handling error situations like * Unknown schema fingerprint * Bad set of bytes (in various forms) I see at least in two general directions: # Return null # Throw an error What is preferred in this case? Which is 'better' for the application developers? > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247442#comment-15247442 ] Niels Basjes commented on AVRO-1704: I agree with what you are saying. So the header should be shorter, but not too short. I think that having only 1 byte is too short, 2 bytes should be fine: 1 marker byte, 1 body version byte. So the updated proposal becomes: * Header becomes 2 bytes in total. 'Ã' '' ** I chose the à (0xC3) because *** It is a 'human readable character' *** it looks like an 'A' (from Avro) under a 'Wave' and since currently the primary use case is streaming this seems like the right marker. *** Also this is a very uncommon character so if we see this the collision probability drops dramatically. ** The '' can be any byte that essentially defines the record structure that follows. This can be used to indicate for example the difference between a normal record and an encrypted record. *** I think that we should also pick an 'uncommon' byte for this one to mark the default record version. I think this one is a good candicate: '»' (0xBB) because it looks like a symbol for 'fast'. * The default body (i.e. version 0xBB) becomes ** body: fingerprint record *** fingerprint = CRC-64-AVRO(normalized schema) (8 bytes, little endian) *** record = encoded Avro bytes using schema So the overall record using the default body structure would look like this: {code} message = header body header = 'û' (== 0xC3 0xBB) body ={code} In the generated code I'll see what can be done to make both the header and body code 'pluggable'. I think that the Schema Storage should get a capped 'cache' (LRU?) that retains the fingerprints that are 'known to not exist'. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244991#comment-15244991 ] Ryan Blue commented on AVRO-1704: - Sorry if what I said wasn't clear. I'm not proposing that we get rid of the header. I'm saying that we make it one byte instead of 4. I think what I outlined addresses the case where the schema cache miss is expensive and balances that with the per-message overhead. (I'm fine moving forward with the FP considered part of the body.) A one-byte header results in lower than a 1/256 chance of an expensive lookup (by choosing carefully). Why is that too high? Why 4 bytes and not, for example, 2 for a 1/65536 chance? I disagree that the impact of extra bytes is too small to matter. It (probably) won't cause fragmentation when sending one message, but we're not talking about just one message. Kafka's performance depends on batching records together for network operations and each message takes up space on disk. What matters is the percentage of data that is overhead. 4 bytes if your messages are 500 is 0.8%, and it is 4% if your messages are 100 bytes. In terms of how much older data I can keep in a Kafka topic, that accounts for 11m 30s to 57m 30s per day. If I provision for a 3-day window of data in Kafka, I'm losing between half an hour and 3 hours of that just to store 'Avr0' over and over. That's why I think we have to strike a balance between the two concerns. 1 or 2 bytes should really be sufficient, depending on the probability of a false-positive we want. And false-positives are only that costly if each one causes an RPC, which we can avoid with a little failure detection logic. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244947#comment-15244947 ] Niels Basjes commented on AVRO-1704: A few of the thoughts I had when creating the current patch: # Regarding the 'Avro' header (which I still believe to be 'the way to go') #* The cost of going to the Schema registry is high on a 'cache mis'. Problems like I ran into with STORM-512 will occur in other systems too and may very well cause an overload on the schema registry. #* I consider the cost of a fixed header of 4 bytes to be low. But that really depends on the size of the record being transmitted (my records are in the 500-1000 bytes range). #** These extra bytes will only be persisted in streaming systems like Kafka. Long term file formats (like AVRO, Parquet and ORC) won't store this. #** In network traffic the overhead is 'unmeasurably small' because it is unlikely the record will go over the size of a single TCP packet (1500) because of these 4 bytes. # Regarding the schema fingerprint (which I consider a 'body' part). #* The idea of the 'version' was that someone may want to use a different 'hash' instead of the CRC-64-AVRO. #* I think that in case of encryption we should have the fingerprint encrypted too. *In light of the encryption option and your comments I'm now considering this _brainwave_*: * The 'header of the message' should be pluggable. ** The default is a 'fixed shape' which includes a format id. (Same as what my current patch does). ** I expect that making this pluggable too is possible but that would have some restrictions like "all records of a schema must adhere to the same base format". * The 'body of the message' should be pluggable too. ** Format '0' is hardcoded (fingerprint+record). ** Yet other versions (we should define a range like 0x80-0xFF) can be used by anyone to define a custom body definition (including encryption). I expect these versions to only exist within a specific company. If they need to exchange data with others they should share their format specification anyway. * If we set the code up right we can have a layering system: I.e. someone can 'insert' an encryption layer and still use the 'standard' body (after decryption). ** Such an 'encryption layer' would add additional parts like a encryption type and a key id. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244347#comment-15244347 ] Ryan Blue commented on AVRO-1704: - Looks like I was a little too optimistic about time to review things this week. Sorry for the delay. I think we're close to a spec. Here are some additional thoughts. Looks like everyone is for using the CRC-64-AVRO fingerprint, which is good because it can be implemented in each language and doesn't require a library dependency. That's also what's often used in practice. +1 for an interface in Avro that lets you plug in a schema resolver. I think the fingerprint should be considered part of the header rather than the body. It's a small distinction, but the fingerprint is a proxy for the schema here and the body/payload depends on it. Schema is in the container file header, so it is consistent. I want to avoid a 4-byte sentinel value in each message. There are two uses for it: to make sure the message is Avro and to communicate the format version should we want to change it later. Because the schema fingerprint is included in the message, it is very unlikely that unknown payloads will be read as Avro messages because it requires a collision with an 8-byte schema fingerprint. I think that's plenty of protection from passing along corrupt data. The concern that doesn't address is what happens when a fingerprint is unknown, which is a lot of cases will cause a REST call to resolve it. I don't think adding 4 bytes to every encoded payload is worth avoiding this case when the lookup can detect some number of failures and stop making the RPC calls. I just don't think we should design around a solvable problem in the format like that. I think the second use, versioning the format, is a good idea. That only requires one byte and including that byte can also serve as a way to detect non-Avro payloads, just with a higher probability of collision. I think that's a reasonable compromise. There would be something a 1/256 chance that the first byte collides, assuming that byte is random in the non-Avro payload. That dramatically reduces the problem of making RPC calls to resolve unknown schema FPs. We want to choose the version byte carefully because other formats could easily have 0x00, 0x01, or an ASCII character there. I propose the version number with the MSB set, 0x80. That's unlikely to conflict with a flags byte, the first byte of a number, or the first character of a string. That makes the format: {code} message = header body header = 0x80 CRC-64-AVRO(schema) (8 bytes, little endian) body = encoded Avro bytes using schema {code} We could additionally have a format with a 4-byte FP, version 0x81, if anyone is interested in it. Something simple like XOR the first 4 bytes with the second 4 bytes of the CRC-64-AVRO fingerprint. 8 bytes just seems like a lot when this gets scaled up to billions of records. One last thought: in the implementation, it would be nice to allow skipping the version byte because a lot of people have already implemented this as CRC-64-AVRO + encoded bytes. That would make the Avro implementation compatible with existing data flows and increase the chances that we can move to this standard format. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238996#comment-15238996 ] Niels Basjes commented on AVRO-1704: I have a first addition: Think about supporting encrytion. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235438#comment-15235438 ] Ryan Blue commented on AVRO-1704: - Thanks for working on this, Niels. I'll make some comments later today or tomorrow. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210535#comment-15210535 ] Niels Basjes commented on AVRO-1704: I did some experimenting over the last week and I posted my changed version of Avro here: https://github.com/nielsbasjes/avro/tree/AVRO-1704 What I did so far: # Added to Schema the getFingerPrint() method that uses the CRC-64-AVRO to calculate the schema finger print. # Added a few SchemaStorage related classes that allow storing schemas in memory. # Added to the generated classes the toBytes() method and the fromBytes static method. Both effectively call the 'real' implementations which are in the SpecificRecordBase class. All of this passes all of the Java unit testing. At the application end my test code (using 3 slightly different variations of the same schema) looks like this. This works exactly as I expect it to. {code:java} SchemaFactory.put(com.bol.measure.v1.Measurement.getClassSchema()); SchemaFactory.put(com.bol.measure.v2.Measurement.getClassSchema()); SchemaFactory.put(com.bol.measure.v3.Measurement.getClassSchema()); com.bol.measure.v1.Measurement measurement = DummyMeasurementFactory.createTestMeasurement(timestamp); byte[] bytesV1 = measurement.toBytes(); com.bol.measure.v2.Measurement newBornV2 = com.bol.measure.v2.Measurement.fromBytes(bytesV1); com.bol.measure.v3.Measurement newBornV3 = com.bol.measure.v3.Measurement.fromBytes(bytesV1); {code} Things currently missing: Documentation, extra tests, etc. I could really use some feedback on the structure of my change and advice on how to approach the need to call a 'close()' method on the schema storage part. Thanks. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190866#comment-15190866 ] Niels Basjes commented on AVRO-1704: Thanks for pointing this out. My updated proposal for this: {code}"Avro"{code} Where # "version" = 1 byte indicating the version (or "schema") of the rest of the bytes. if version == 0x00 # "Fingerprint" = the CRC-64-AVRO of the Canonical form of the Schema. # "Record" = the record serialized to byte using the existing serialization system. I personally do not like these 'chopped' prefixes if there is no "really good reason to chop them" (like the length). Because the projects name is so short: In this proposal I'm sticking to using the full name of the project as the prefix: "Avro" (i.e. these 4 bytes 0x41, 0x76, 0x72, 0x6F) > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189594#comment-15189594 ] Doug Cutting commented on AVRO-1704: bq. remove the things that do not impact the binary form of the record This is already done as part of fingerprint calculation. https://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas Also, if we opt for a prefix, we might use something like 'A'+'v'+'r'+0, where the last character also indicates the format version, including schema hash function. That's similar to what's used to label the file format, and has a side benefit of clearly demonstrating that this is binary, non-textual data. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189473#comment-15189473 ] Niels Basjes commented on AVRO-1704: Note that having the "AVRO" prefix will also limit the number of needless calls to the Schema registry when bad records are put into the stream (like the Timer ticks example). > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189402#comment-15189402 ] Niels Basjes commented on AVRO-1704: I've been looking into what kind of solution would work here since I'm working on a project where we need datastructures going into Kafka and be available to multiple consumers. The fundamental problem we need to solve is that of "Schema Evolution" in a streaming environment (Let's assume Kafka with the built in persistence of records). We need three things to make this happen: # A way to recognize a 'blob' is a serialized AVRO record. #* We can simply assume it is always an AVRO record. #* I think we should simply let such a record start with "AVRO" to ensure we can cleanly catch problems like this STORM-512 (Summary: Timer ticks we written into Kafka which caused a lot of deserialization errors in reading the AVRO records.) # A way to determine the schema this was written with. #* As indicated above I vote for using the CRC-64-AVRO. #** I noticed that a simple typo fix in the documentation of a Schema causes a new fingerprint to be generated. #** Proposal: I think we should 'clean' the schema before calculating the fingerprint. I.e. remove the things that do not impact the binary form of the record (like the doc field). # Have a place where we can find the schemas using the fingerprint as the key. #* Here I think (looking at AVRO-1124 and the fact that there are ready to run implementations like this [Schema Registry|http://docs.confluent.io/current/schema-registry/docs/index.html]) we should limit what we keep inside Avro to something like a "SchemaFactory" interface (as the storage/retrieval interface to get a Schema) and a very basic implementation that simply reads the available schema's from a (set of) property file(s). Using this others can write additional implementations that can read/write to things like databases or the above mentioned Schema Registry. So to summarize my proposal on the standard for the {{Single record serialization format}} can be written as: {code}"AVRO"{code} [~rdblue], I'm seeking feedback from you guys on this proposal. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133238#comment-15133238 ] ASF GitHub Bot commented on AVRO-1704: -- Github user asfgit closed the pull request at: https://github.com/apache/avro/pull/43 > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906640#comment-14906640 ] Ryan Blue commented on AVRO-1704: - [~dasch], I think the most common one is CRC-64-AVRO. That's exactly why we need to standardize this though. I think we should go with just one and it would be good to have confirmation from the Kafka and Flume communities on which one they currently use. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900357#comment-14900357 ] Daniel Schierbeck commented on AVRO-1704: - [~rdblue] If there's already widespread usage of `` then I can simply implement that. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740476#comment-14740476 ] Daniel Schierbeck commented on AVRO-1704: - I think it's fine to standardize on a single fingerprint type. As for the metadata map, I was thinking that it would be nice for generic tools to use, e.g. keeping track of Kafka offsets and partitions when moving encoded data around. It's not a requirement, though, so if it's easier to get traction without it I wouldn't mind. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740477#comment-14740477 ] Daniel Schierbeck commented on AVRO-1704: - If we can agree on a format I can do the Ruby implementation. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739165#comment-14739165 ] Ryan Blue commented on AVRO-1704: - I think this is a good idea. Quite a few people are doing this already, but with ad-hoc formats. [~granthenke] and [~gwenshap] are probably interested in this topic as well. I think the one that is the most widely used is simply the 8-byte schema fingerprint from Java (SHA256?) followed by the encoded bytes. For compatibility with existing data in Kafka, I'd recommend going with that unless we have good reason to change it. I think it's better to specify the fingerprint ahead of time so we don't waste space encoding which one (or requiring more complicated code). That leaves the format version number and metadata map, keeping in mind that if we decide we need either one then we are breaking compatibility with existing data and tools -- that's not too bad, but we should be aware of it. I like the idea of a format version number, but it might be unnecessary. I'm interested to hear what you envision the key/value metadata would be used for, too. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629488#comment-14629488 ] ASF GitHub Bot commented on AVRO-1704: -- GitHub user dasch opened a pull request: https://github.com/apache/avro/pull/43 AVRO-1704: Standardized format for encoding messages with Avro This is a proof of concept implementation of [AVRO-1704](https://issues.apache.org/jira/browse/AVRO-1704). - The fingerprint implementation is mocked out. - Only 64-bit fingerprints are supported. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dasch/avro dasch/message-format Alternatively you can review and apply these changes as the patch at: https://github.com/apache/avro/pull/43.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #43 commit 5765e59879e2c70ec2095dd666105d26e0d592fc Author: Daniel Schierbeck da...@zendesk.com Date: 2015-07-16T09:05:38Z Add the Avro::Message format commit f1286548ebf0e2b8ef50d604251fcfbd70137b8b Author: Daniel Schierbeck da...@zendesk.com Date: 2015-07-16T09:28:03Z Add SchemaStore Currently it's using a mock fingerprint implementation and only stores 64-bit fingerprints. Standardized format for encoding messages with Avro --- Key: AVRO-1704 URL: https://issues.apache.org/jira/browse/AVRO-1704 Project: Avro Issue Type: Improvement Reporter: Daniel Schierbeck I'm currently using the Datafile format for encoding messages that are written to Kafka and Cassandra. This seems rather wasteful: 1. I only encode a single record at a time, so there's no need for sync markers and other metadata related to multi-record files. 2. The entire schema is inlined every time. However, the Datafile format is the only one that has been standardized, meaning that I can read and write data with minimal effort across the various languages in use in my organization. If there was a standardized format for encoding single values that was optimized for out-of-band schema transfer, I would much rather use that. I think the necessary pieces of the format would be: 1. A format version number. 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. 3. The actual schema fingerprint (according to the type.) 4. Optional metadata map. 5. The encoded datum. The language libraries would implement a MessageWriter that would encode datums in this format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums. The reader would decode the fingerprint and ask its SchemaStore to return the corresponding writer's schema. The idea is that SchemaStore would be an abstract interface that allowed library users to inject custom backends. A simple, file system based one could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)