[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated AVRO-1704: -- Fix Version/s: (was: 1.8.3) 1.8.2 > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.2 > > Attachments: AVRO-1704-20160410.patch, > AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated AVRO-1704: Resolution: Fixed Status: Resolved (was: Patch Available) I committed the last patch with the spec changes, which closes out this issue. Thanks [~nielsbasjes], [~cutting], [~busbey], and [~dasch] for making this happen! > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated AVRO-1704: -- Component/s: spec java > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Components: java, spec >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated AVRO-1704: Attachment: AVRO-1704.4.patch > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated AVRO-1704: Status: Patch Available (was: Open) > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch, AVRO-1704.3.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated AVRO-1704: -- Fix Version/s: (was: 1.7.8) > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated AVRO-1704: -- Fix Version/s: 1.7.8 > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated AVRO-1704: -- Fix Version/s: 1.8.3 > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated AVRO-1704: -- Fix Version/s: 1.9.0 > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Fix For: 1.9.0 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated AVRO-1704: --- Status: Open (was: Patch Available) > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated AVRO-1704: --- Status: Patch Available (was: Open) Although this patch is not yet finished I would really like review comments at this point from other committers (like [~rdblue] and if possible [~cutting]). Thanks. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated AVRO-1704: --- Attachment: AVRO-1704-2016-05-03-Unfinished.patch This patch does just about everything we talked about. Both the schema storage and the serialization of the body are pluggable. I created a single record serializer that does an 'xor' obfuscation on the binary. I see this as enough proof that someone else can later create a proper encryption layer. Main things that still need to be done: # What do we call this? "Single record serializer" ? # The currently generated methods are toBytes and fromBytes. Do we keep those names or should it be more explicit? Like toSingleRecordBytes or toBytesWithSchema or ... # Check the format of the fingerprint in the byte[] (should be little endian) and see if there is an existing method that does this in a performant way (suggestions are welcome). # Naming of packages and classes. I find some of the current names I came up with "sub-optimal". Please review and provide input to the points above. Thanks. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated AVRO-1704: --- Status: Open (was: Patch Available) > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated AVRO-1704: --- Status: Patch Available (was: Open) During the last few weeks I spent some time figuring out what I think the format should be. I created this patch which includes specification for the new format, code generators for Java and unit tests that validate the format in light of schema evolution and corrupt data. I documented the new format as follows: {quote} Schema tagged Binary Encoding specification The wrapper format consists of a header and a body. The header is always the 4 bytes representing the UTF-8 of the word "Avro" followed by a single byte indicating the version of the body format. Version 0 of the body (currently the ONLY body format that has been defined) consists of: # the finger print (see the section about Schema Fingerprints of the schema (a 64 bit long) that was written in the same byte order as a long is when written if it was a field in a record. # the record serialized to byte using the binary encoding. {quote} Although I thing this is already "pretty good" I really think this needs your comments and improvement suggestions. Thanks. > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated AVRO-1704: --- Attachment: AVRO-1704-20160410.patch > Standardized format for encoding messages with Avro > --- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement >Reporter: Daniel Schierbeck >Assignee: Niels Basjes > Attachments: AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro
[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Schierbeck updated AVRO-1704: Description: I'm currently using the Datafile format for encoding messages that are written to Kafka and Cassandra. This seems rather wasteful: 1. I only encode a single record at a time, so there's no need for sync markers and other metadata related to multi-record files. 2. The entire schema is inlined every time. However, the Datafile format is the only one that has been standardized, meaning that I can read and write data with minimal effort across the various languages in use in my organization. If there was a standardized format for encoding single values that was optimized for out-of-band schema transfer, I would much rather use that. I think the necessary pieces of the format would be: 1. A format version number. 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. 3. The actual schema fingerprint (according to the type.) 4. Optional metadata map. 5. The encoded datum. The language libraries would implement a MessageWriter that would encode datums in this format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums. The reader would decode the fingerprint and ask its SchemaStore to return the corresponding writer's schema. The idea is that SchemaStore would be an abstract interface that allowed library users to inject custom backends. A simple, file system based one could be provided out of the box. was: I'm currently using the Datafile format for encoding messages that are written to Kafka and Cassandra. This seems rather wasteful: 1. I only encode a single record at a time, so there's no need for sync markers and other metadata related to multi-record files. 2. The entire schema is inlined every time. However, the Datafile format is the only one that has been standardized, meaning that I can read and write data with minimal effort across the various languages in use in my organization. If there was a standardized format for encoding single values that was optimized for out-of-band schema transfer, I would much rather use that. I think the necessary pieces of the format could be: 1. A format version number. 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. 3. The actual schema fingerprint (according to the type.) 4. Optional metadata map. 5. The encoded datum. The language libraries would implement a MessageWriter that would encode datums in this format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums. The reader would decode the fingerprint and ask its SchemaStore to return the corresponding writer's schema. The idea is that SchemaStore would be an abstract interface that allowed library users to inject custom backends. A simple, file system based one could be provided out of the box. Standardized format for encoding messages with Avro --- Key: AVRO-1704 URL: https://issues.apache.org/jira/browse/AVRO-1704 Project: Avro Issue Type: Improvement Reporter: Daniel Schierbeck I'm currently using the Datafile format for encoding messages that are written to Kafka and Cassandra. This seems rather wasteful: 1. I only encode a single record at a time, so there's no need for sync markers and other metadata related to multi-record files. 2. The entire schema is inlined every time. However, the Datafile format is the only one that has been standardized, meaning that I can read and write data with minimal effort across the various languages in use in my organization. If there was a standardized format for encoding single values that was optimized for out-of-band schema transfer, I would much rather use that. I think the necessary pieces of the format would be: 1. A format version number. 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. 3. The actual schema fingerprint (according to the type.) 4. Optional metadata map. 5. The encoded datum. The language libraries would implement a MessageWriter that would encode datums in this format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums. The reader would decode the fingerprint and ask its SchemaStore to return the corresponding writer's schema. The idea is that SchemaStore would be an abstract interface that allowed library users to inject custom backends. A simple, file system based one could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)