[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2017-06-01 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated AVRO-1704:
--
Fix Version/s: (was: 1.8.3)
   1.8.2

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.2
>
> Attachments: AVRO-1704-20160410.patch, 
> AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-09-04 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated AVRO-1704:

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I committed the last patch with the spec changes, which closes out this issue. 
Thanks [~nielsbasjes], [~cutting], [~busbey], and [~dasch] for making this 
happen!

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-26 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated AVRO-1704:
--
Component/s: spec
 java

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-24 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated AVRO-1704:

Attachment: AVRO-1704.4.patch

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-24 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated AVRO-1704:

Status: Patch Available  (was: Open)

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-05 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated AVRO-1704:
--
Fix Version/s: (was: 1.7.8)

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-05 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated AVRO-1704:
--
Fix Version/s: 1.7.8

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-05 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated AVRO-1704:
--
Fix Version/s: 1.8.3

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-05 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated AVRO-1704:
--
Fix Version/s: 1.9.0

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-05-05 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated AVRO-1704:
---
Status: Open  (was: Patch Available)

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-05-03 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated AVRO-1704:
---
Status: Patch Available  (was: Open)

Although this patch is not yet finished I would really like review comments at 
this point from other committers (like [~rdblue] and if possible [~cutting]).
Thanks.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-05-03 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated AVRO-1704:
---
Attachment: AVRO-1704-2016-05-03-Unfinished.patch

This patch does just about everything we talked about.
Both the schema storage and the serialization of the body are pluggable. 

I created a single record serializer that does an 'xor' obfuscation on the 
binary. I see this as enough proof that someone else can later create a proper 
encryption layer.

Main things that still need to be done:
# What do we call this? "Single record serializer" ?
# The currently generated methods are toBytes and fromBytes. Do we keep those 
names or should it be more explicit? Like toSingleRecordBytes or 
toBytesWithSchema or ...
# Check the format of the fingerprint in the byte[] (should be little endian) 
and see if there is an existing method that does this in a performant way 
(suggestions are welcome).
# Naming of packages and classes. I find some of the current names I came up 
with "sub-optimal".

Please review and provide input to the points above.
Thanks.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-28 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated AVRO-1704:
---
Status: Open  (was: Patch Available)

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-10 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated AVRO-1704:
---
Status: Patch Available  (was: Open)

During the last few weeks I spent some time figuring out what I think the 
format should be. I created this patch which includes specification for the new 
format, code generators for Java and unit tests that validate the format in 
light of schema evolution and corrupt data.

I documented the new format as follows:
{quote}
Schema tagged Binary Encoding specification

The wrapper format consists of a header and a body.
The header is always the 4 bytes representing the UTF-8 of the word "Avro" 
followed by a single byte indicating the version of the body format.

Version 0 of the body (currently the ONLY body format that has been defined) 
consists of:
#  the finger print (see the section about Schema Fingerprints of the schema (a 
64 bit long) that was written in the same byte order as a long is when written 
if it was a field in a record.
# the record serialized to byte using the binary encoding.
{quote}

Although I thing this is already "pretty good" I really think this needs your 
comments and improvement suggestions.

Thanks.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-10 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated AVRO-1704:
---
Attachment: AVRO-1704-20160410.patch

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AVRO-1704) Standardized format for encoding messages with Avro

2015-07-16 Thread Daniel Schierbeck (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Schierbeck updated AVRO-1704:

Description: 
I'm currently using the Datafile format for encoding messages that are written 
to Kafka and Cassandra. This seems rather wasteful:

1. I only encode a single record at a time, so there's no need for sync markers 
and other metadata related to multi-record files.
2. The entire schema is inlined every time.

However, the Datafile format is the only one that has been standardized, 
meaning that I can read and write data with minimal effort across the various 
languages in use in my organization. If there was a standardized format for 
encoding single values that was optimized for out-of-band schema transfer, I 
would much rather use that.

I think the necessary pieces of the format would be:

1. A format version number.
2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
3. The actual schema fingerprint (according to the type.)
4. Optional metadata map.
5. The encoded datum.

The language libraries would implement a MessageWriter that would encode datums 
in this format, as well as a MessageReader that, given a SchemaStore, would be 
able to decode datums. The reader would decode the fingerprint and ask its 
SchemaStore to return the corresponding writer's schema.

The idea is that SchemaStore would be an abstract interface that allowed 
library users to inject custom backends. A simple, file system based one could 
be provided out of the box.

  was:
I'm currently using the Datafile format for encoding messages that are written 
to Kafka and Cassandra. This seems rather wasteful:

1. I only encode a single record at a time, so there's no need for sync markers 
and other metadata related to multi-record files.
2. The entire schema is inlined every time.

However, the Datafile format is the only one that has been standardized, 
meaning that I can read and write data with minimal effort across the various 
languages in use in my organization. If there was a standardized format for 
encoding single values that was optimized for out-of-band schema transfer, I 
would much rather use that.

I think the necessary pieces of the format could be:

1. A format version number.
2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
3. The actual schema fingerprint (according to the type.)
4. Optional metadata map.
5. The encoded datum.

The language libraries would implement a MessageWriter that would encode datums 
in this format, as well as a MessageReader that, given a SchemaStore, would be 
able to decode datums. The reader would decode the fingerprint and ask its 
SchemaStore to return the corresponding writer's schema.

The idea is that SchemaStore would be an abstract interface that allowed 
library users to inject custom backends. A simple, file system based one could 
be provided out of the box.


 Standardized format for encoding messages with Avro
 ---

 Key: AVRO-1704
 URL: https://issues.apache.org/jira/browse/AVRO-1704
 Project: Avro
  Issue Type: Improvement
Reporter: Daniel Schierbeck

 I'm currently using the Datafile format for encoding messages that are 
 written to Kafka and Cassandra. This seems rather wasteful:
 1. I only encode a single record at a time, so there's no need for sync 
 markers and other metadata related to multi-record files.
 2. The entire schema is inlined every time.
 However, the Datafile format is the only one that has been standardized, 
 meaning that I can read and write data with minimal effort across the various 
 languages in use in my organization. If there was a standardized format for 
 encoding single values that was optimized for out-of-band schema transfer, I 
 would much rather use that.
 I think the necessary pieces of the format would be:
 1. A format version number.
 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
 3. The actual schema fingerprint (according to the type.)
 4. Optional metadata map.
 5. The encoded datum.
 The language libraries would implement a MessageWriter that would encode 
 datums in this format, as well as a MessageReader that, given a SchemaStore, 
 would be able to decode datums. The reader would decode the fingerprint and 
 ask its SchemaStore to return the corresponding writer's schema.
 The idea is that SchemaStore would be an abstract interface that allowed 
 library users to inject custom backends. A simple, file system based one 
 could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)