[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2017-06-01 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033078#comment-16033078
 ] 

Sean Busbey commented on AVRO-1704:
---

I think that's because the fix version wasn't properly set when it got closed 
out. I've updated it to be 1.8.2 now, so it should be in the release notes.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.2
>
> Attachments: AVRO-1704-20160410.patch, 
> AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2017-06-01 Thread Jacob Rideout (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033068#comment-16033068
 ] 

Jacob Rideout commented on AVRO-1704:
-

Hmmm ... It looks like it is in the branch-1.8. I am confused since it is NOT 
listed in https://s.apache.org/avro-release-note-1.8.2

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.2
>
> Attachments: AVRO-1704-20160410.patch, 
> AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2017-06-01 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033065#comment-16033065
 ] 

Sean Busbey commented on AVRO-1704:
---

Looks like it's in 1.8.2 to me: 
http://avro.apache.org/docs/1.8.2/spec.html#single_object_encoding

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.2
>
> Attachments: AVRO-1704-20160410.patch, 
> AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2017-06-01 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033060#comment-16033060
 ] 

Sean Busbey commented on AVRO-1704:
---

The JIRA is resolved and it was listed as a blocker for 1.8.2. Is it not 
actually in branch-1.8?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-20160410.patch, 
> AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2017-06-01 Thread Jacob Rideout (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033034#comment-16033034
 ] 

Jacob Rideout commented on AVRO-1704:
-

What needs to be done to land this in 1.8.3?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-20160410.patch, 
> AVRO-1704-2016-05-03-Unfinished.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-12-01 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712516#comment-15712516
 ] 

Ryan Blue commented on AVRO-1704:
-

You mean erring on the side of caution and using a larger hash? I don't think 
collisions with a 64-bit fingerprint are likely enough to cause any trouble. 
And, while you don't calculate the fingerprint every time, you do send it in 
the message.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-12-01 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712508#comment-15712508
 ] 

Ryan Blue commented on AVRO-1704:
-

With a spec like this, we want to be careful about having too many things that 
must be implemented. I think there would have to be a very good reason to add 
additional hashes to the spec.

If you're interested in using the Avro MessageEncoder and MessageDecoder, then 
that shouldn't be too difficult because the code is modular enough you can 
implement a decoder for your message format fairly easily.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-10-13 Thread radai rosenblatt (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572449#comment-15572449
 ] 

radai rosenblatt commented on AVRO-1704:


Also, since this is somewhat Kafka related, i would like to point to this kafka 
proposal for headers in the kafka wire format - 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers
discussion thread is here - 
http://mail-archives.apache.org/mod_mbox/kafka-dev/201609.mbox/%3C1474572662302.81658%40ig.com%3E

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-10-13 Thread radai rosenblatt (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572398#comment-15572398
 ] 

radai rosenblatt commented on AVRO-1704:


At LinkedIn we use a similar scheme for our avro payloads over kafka, but we 
use a 128bit hash for schema identifier.
Would it be possible to still make the hashing scheme changeable to make the 
transition easier for organizations not using 64bit schema ids?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-09-04 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15463478#comment-15463478
 ] 

Ryan Blue commented on AVRO-1704:
-

Thanks for reviewing!

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-09-04 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15463379#comment-15463379
 ] 

Sean Busbey commented on AVRO-1704:
---

+1 on AVRO-1704.4.patch

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-09-03 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15461949#comment-15461949
 ] 

Ryan Blue commented on AVRO-1704:
-

I'm marking this as a blocker for the 1.8.2 release because the code is 
committed. If we release the implementation, I think we should also include the 
spec changes.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-09-03 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15461947#comment-15461947
 ] 

Ryan Blue commented on AVRO-1704:
-

[~busbey], could you have a look at the last patch I posted with the spec 
changes? I'd like to get it into 1.8.2 since the code is. Thank you!

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>  Components: java, spec
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-25 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391974#comment-15391974
 ] 

Sean Busbey commented on AVRO-1704:
---

FWIW, I belatedly agree with Doug's statement.

Do we have our compatibility promises documented somewhere? I feel like I have 
a good sense of them, but I don't know if that's just because I've been in the 
community for several years.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch, AVRO-1704.4.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391193#comment-15391193
 ] 

Ryan Blue commented on AVRO-1704:
-

I just committed the Java implementation, with additional Javadoc. This did not 
include the incompatible changes, which should be done in a separate issue.

I also took the spec from [~nielsbasjes]'s patch and updated it:
* Use "object" instead of "record" to be more clear that it doesn't have to be 
an Avro record
* Use C3 01 for the header
* Simplify the encoding spec as much as possible

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch, AVRO-1704.3.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391117#comment-15391117
 ] 

Ryan Blue commented on AVRO-1704:
-

Sounds good to me! I'll fix the missing Javadoc and remove that change.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-24 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391091#comment-15391091
 ] 

Doug Cutting commented on AVRO-1704:


We don't promise source-compatibility for minor Avro releases, but do for dot 
releases.  So this should not go into 1.8.x but could go into 1.9.0.  (An 
incompatible change to data formats would require a 2.0.)

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-22 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390333#comment-15390333
 ] 

Ryan Blue commented on AVRO-1704:
-

For the createDatumReader/Writer change: it is [binary compatible because of 
type 
erasure|https://docs.oracle.com/javase/specs/jls/se7/html/jls-13.html#jls-13.4.13],
 but not source compatible.

To work around not constructing these with the type parameter, some users will 
cast to the right type, like this:

{code:lang=java}
DatumReader reader = (DatumReader) 
GenericData.get().createDatumReader(schema);
{code}

That compiles in 1.8.1 because it is casting {{DatumReader}} to 
{{DatumReader}}, but not with this change. After the change, it 
returns a {{DatumReader}} that Java won't convert. The fix is to remove 
the cast and then Java correctly infers that the type parameter is 
{{GenericRecord}} instead of {{Object}}.

Do we guarantee source compatibility? Even if we do not, [~busbey], what do you 
think about including this incompatibility?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-22 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389787#comment-15389787
 ] 

Doug Cutting commented on AVRO-1704:


+1 overall.

Two minor questions:
 - Is the change to the createDatumReader/Writer API fully back-compatible?
 - I think a few of the new public methods don't have javadoc.  It's probably 
worth building the docs and glancing through them to see how they look.  That 
usually inspires a lot of improvements and is especially useful with new APIs 
like this.

Other than that, LGTM.


> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-20 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387031#comment-15387031
 ] 

Ryan Blue commented on AVRO-1704:
-

I agree with your reasoning on naming, so lets go with MessageEncoder. I think 
that's reasonably distinct from the other classes.

By my builder comment, I meant that if we want to make it easier to instantiate 
a MessageDecoder we could add a builder rather than a factory method. That 
would make it easy to seed the decoder with compatible Schemas and select the 
GenericData subclass. Something like this:

{code:lang=java}
MessageDecoder decoder = MessageDecoder.builder()
.read(MyDatum.class)
.schema(oldSchema1)
.schema(oldSchema2)
.build();
{code}

I don't think this is needed yet, since the constructors are fairly simple.

I think the implementation is about ready to commit, followed by the spec 
updated for the 2-byte header used in this implementation. Is there anything 
else you think we should change?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-18 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382696#comment-15382696
 ] 

Doug Cutting commented on AVRO-1704:


I doubt we'll ever need this abstracted, and having it so might encourage a 
proliferation of message formats, but it might also prove useful someday, so I 
can live with that.  However I still don't see the need for multiple levels of 
abstraction (interface + abstract base class).  That still seems like vast 
overkill to me, but is probably not worth fighting about.

As far as terminology, "datum" is used to refer to the in-memory data structure 
(generic, specific, reflect, Thrift, protobuf) while "encode/decode" refer to 
specific serialized formats (binary, json).  A reader/writer translates between 
the in-memory structure and the abstract encoding API.  So where does "message" 
fit into this taxonomy?  I suppose it's a new serialized format, an extension 
of "binary", so "encode/decode" are probably more appropriate than "read/write".

Not sure what you mean about the builder.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381561#comment-15381561
 ] 

Ryan Blue commented on AVRO-1704:
-

I think this should be abstract. The format that we're adding solves one set of 
uses, but the utility methods have value beyond that. Encoding a single Avro 
record is fairly common, but the implementations vary widely in quality because 
it is difficult to find the right setup of DatumWriter, BinaryEncoder, and 
ByteArrayOutputStream. Simplifying and improving applications that already do 
this is a good thing. And some of those uses, like the case I mentioned where 
we're embedding Avro in Parquet records, don't need the header or schema at all 
because that's defined in the file metadata.

The abstraction is also useful for transitioning to the format we're defining 
here. The normal way to encode messages in Kafka is the 8-byte fingerprint 
followed by the encoded message payload. With the abstraction, you can write a 
decoder that checks for the header and then deserializes, or assumes the old 
format if the header is missing. That would enable rolling upgrades using the 
same Kafka topics, rather than needing a hard transition.

I would also include the abstraction in case we want to change or introduce a 
new format later.

bq. I also worry that names like BinaryDatumDecoder

I've pushed a new commit that moves the classes to org.apache.avro.message and 
renames them to MessageEncoder and MessageDecoder. I think used "encoder" 
instead of "reader" to contrast with the DatumReader and DatumWriter, since 
there is little difference between a datum and a message (a datum to encode by 
itself).

bq. Perhaps [the reusable i/o straems] should go in the util package so they 
can be used more widely?

I've moved them there. I avoided it before so that they weren't added to the 
public API, but I think it's fine to make them available.

bq. We might also add utilities for generic & reflect, like, 
model#getMessageWriter(Schema)?

I looked at this, but then the GenericData classes would have both 
createDatumWriter and getMessageWriter, which looks confusing to me. Keeping 
the DatumEncoder above the level of the data models helps separate the 
DatumWriter from the MessageEncoder.

If we want to make instantiating these easier, then maybe a builder would be 
more appropriate. That would allow us to pass multiple writer schemas to the 
MessageDecoder.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-11 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371389#comment-15371389
 ] 

Doug Cutting commented on AVRO-1704:


I don't see why anyone would prefer the interface to the abstract base class.  
It seems like belt and suspenders (https://youtube.com/watch?v=VuWzeoIr7J4).   
Who do we imagine would implement this outside of the project?

Frankly, I question this needs even be abstract.  Applications will use this 
API because they want to use Avro's tagged binary encoding for messages.  
Applications that want an untagged binary encoding can use the existing APIs.  
The in-memory format is already abstracted, and the encoding is fixed.  What 
we're providing here isn't an extensible framework, it's some utility code.  
Folks who seek to optimize away the 10-byte overhead can use a DatumWriter & 
BinaryEncoder as they do today.  That's an unsafe encoding and we needn't 
further simplify it.  Our goal is to provide an easy-to-use, safe, standard 
encoding for messages.

I also worry that names like BinaryDatumDecoder are confusing, when we already 
have BinaryDecoder and DatumReader.  We might instead call a so-prefixed binary 
encoded datum a "message", and have MessageWriter and MessageReader classes 
that implement this and a MessageSchemaStore, perhaps even placing these all in 
a new "message" package.

I won't reject this patch over these differences in style.  I prefer to not 
hide things behind abstractions until there's clear need.  At that point, when 
multiple implementations are required, one has a better idea of what the 
abstraction should be.  In the mean time, code is substantially smaller, easier 
to read, debug, maintain, etc.  But this is a style issue where reasonable 
folks might differ.

It's hard to believe we don't already have reusable array i/o streams around!  
Perhaps these should go in the util package so they can be used more widely?

I like the convenience methods generated for specific data.  We might also add 
utilities for generic & reflect, like, model#getMessageWriter(Schema)?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-09 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369379#comment-15369379
 ] 

Ryan Blue commented on AVRO-1704:
-

Forgot to add: I've kept the new commits separate so you can see what changed. 
I'll squash them into the implementation when it is time to commit to master if 
this implementation is accepted.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-07-09 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369378#comment-15369378
 ] 

Ryan Blue commented on AVRO-1704:
-

[~cutting], I've pushed a couple new commits to the pull request. The changes 
include:
* Add ReusableByteBufferInputStream and ReusableByteArrayInputStream
* Make the encoder and decoder instances thread-safe
* Remove the thread-local encoder from Specific because the static encoder and 
decoder are now thread-safe
* Add tests using generic

That addresses the review feedback other than the question of whether to use an 
interface or an abstract class. I think the patch has the best of both options 
by including both an interface and an abstract base class 
(DatumDecoder.BaseDecoder) that implementations can use to cut down on 
boilerplate and maintain compatibility. That leaves the choice up to the 
implementer. If you have a strong opinion here, I can change it but I think 
having both is a good solution.

Also, some of the tests are ignored because they don't pass without a 
modification to the ResolvingGrammarGenerator. Aliases don't appear to be 
working. I'm opening another issue with a patch for it.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Fix For: 1.9.0, 1.8.3
>
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-06-28 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353906#comment-15353906
 ] 

Doug Cutting commented on AVRO-1704:


I think all the methods are useful but some of them (e.g., non-reuse) will 
always be implemented by boilerplate and are thus not core to the interface, 
but rather something more suitable for a base class.

An abstract base class would still permit independent alternative 
implementations.  The only additional power an interface has is that one can 
implement multiple interfaces.  But interfaces don't let you implement 
convenience methods, nor do they permit compatible evolution (if you ever add 
or remove a method, you break implementations, because you cannot provide 
default impls).  But if you feel multiple inheritance is important here, then 
it's probably easier to stick to an interface than, e.g., refactor into 
encoder/decoder provider classes that are separate from the user-invoked 
classes or some other way to avoid such boilerplate implementations.

Encoding to a ByteBuffer should be thread-safe, since it has no caller-visible 
state, no?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-06-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353737#comment-15353737
 ] 

Ryan Blue commented on AVRO-1704:
-

I agree that the current interface is wide. I think we should have the datum 
reuse methods, which doubles the API. I think we definitely want the ByteBuffer 
methods. Do you think we don't need the InputStream methods? In the pull 
request there are also byte array methods, but it's easy for callers to use 
ByteBuffer instead.

I like having the interface so that alternative implementations can be 
independent. There's no guarantee that Avro's base class is useful to 
implementers and I don't see a need to force people to inherit from an Avro 
class when it may not make sense. There's an optional base class for 
convenience, so I think the benefits outweigh the cost.

+1 for getting rid of the performance pitfalls. I think we just need to find a 
reusable ByteArrayInputStream and make sure we can change the buffer list in 
ByteBufferInputStream. I'll look into it.

For thread safety we can just make the reused state thread-local like you 
suggest. Right now the Specific methods use a thread-local 
DatumEncoder/DatumDecoder. Do you think the DatumEncoder implementations should 
be thread-safe?

I think we do need the raw format. Right now there are a lot of systems already 
serializing Avro records in the equivalent of the raw format so I would like to 
have an Avro class that helps move to the new spec. Also, if the schema is 
fixed then there's no need for 10 extra bytes per payload so it is 
independently useful. For example, I use the raw format to store JSON payloads. 
The schema won't change and Avro is much smaller and faster.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-06-28 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353614#comment-15353614
 ] 

Doug Cutting commented on AVRO-1704:


That decoder interface seems particularly wide.  Might these be better as base 
classes rather than interfaces?  What power does the interface add?

The initial implementations also have hidden performance pitfalls; some 
operations allocate streams & arrays for every call.  We might either go with a 
lean-and-mean API, or make sure that all of the supported invocations are 
efficient.  I'd prefer inefficiencies be manifest, forcing clients to allocate 
streams per call rather than folks assuming they're using a 
ByteBuffer-optimized API.

To optimize these in a thread-safe manner I think we'd add a 
ThreadLocal field, right?

Do we really need the raw format support?  This is supported by the existing 
API.  The primary goal here is to add support for a new, non-raw "message" 
format.

Without the interface & the raw format, this could become just two utility 
classes, MessageEncoder and MessageDecoder.  Is that too reductive?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-06-27 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352355#comment-15352355
 ] 

Ryan Blue commented on AVRO-1704:
-

[~nielsbasjes], sorry it's taken so long for me to get back to you on this.

On the spec:
* I think we should go with the header: 0xC3 0x01. The first byte makes it 
easily recognizable as you suggest and meets my requirement of minimizing the 
number of non-Avro payloads that match. Using 0x01 makes it easy to see the 
version and will prevent programs confusing payloads with text as Doug suggests.
* I don't see much value in reserving space in the second byte. I don't think 
there will be many formats for serializing Avro payloads and I don't think we 
will have problems with collision.

I've had a look at your patch and there's a lot in there: an update to the 
spec, an implementation, an XOR demo, changes to Schema hashing, specific 
support, and static default classes. I think it would be helpful to get this in 
by breaking up the work into separate patches, pull requests, or issues.

I also think we should simply the API a bit. I'd like to keep it small and grow 
it was we need to keep the maintenance and compatibility simple. For example, 
SchemaStorage has open and close methods that are only used in a test. I'd 
rather not add life-cycle methods like those unless the life-cycle of a 
SchemaStorage needs to be managed by Avro. To that end, I think we can simplify 
the API and I propose the following API:

{code:lang=java}
interface SchemaStore {
  Schema findByFingerprint(long);
}
{code}

I also think that the message API should be focused around a datum and a buffer 
or stream. The data model (GenericData instance) and other things can be passed 
in to create it and then reused for efficiency. I've actually implemented this 
already for a project that stores Avro-encoded payloads in Parquet so I've 
[adapted that implementation|https://github.com/apache/avro/pull/103] to look 
up fingerprints from a SchemaStore. The API is broken into encoder and decoder 
sides to deal with separate concerns: for the encoder that's how to manage 
buffers and for the decoder it's how to resolve schemas and datum reuse.

{code:lang=java}
interface DatumEncoder {
  DatumDecoder(GenericData model, Schema, boolean copyBuffer);
  ByteBuffer encode(D datum); // if copyBuffer was true, this is a new buffer
  void encode(D datum, OutputStream);
}

interface DatumDecoder {
  DatumDecoder(GenericData model, Schema, SchemaStore);
  D decode(ByteBuffer);
  D decode(ByteBuffer, D reuseDatum);
  D decode(InputStream);
  D decode(InputStream, D reuseDatum);
}
{code}

My branch is broken into a few commits. The first two are bug fixes, but the 
third is [the DatumEncoder implementation, 
d91b905|https://github.com/apache/avro/pull/103/commits/d91b90544f4486a72da8d3ff5b81dfc3c79d7c2f],
  and the fourth is [support for the Specific data model, 
7fa75aa|https://github.com/apache/avro/pull/103/commits/7fa75aab405c6460077d7cc7e403c664cce84431],
 based on your patch.

I'd like to hear what you think of the DatumEncoder API in that branch. It 
implements a few things that I think we'll need, like datum reuse, and it 
reuses encoders, DatumWriters, and buffers. It implements two encoder/decoder 
pairs, "raw" that is just the datum bytes and "binary" that implements the 
header and schema lookup. Definitely needs some improvements, like more through 
tests and better naming, like Doug's suggestion to use "message".

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a 

[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352338#comment-15352338
 ] 

ASF GitHub Bot commented on AVRO-1704:
--

GitHub user rdblue opened a pull request:

https://github.com/apache/avro/pull/103

AVRO-1704: Add DatumEncoder API



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/avro 
AVRO-1704-add-datum-encoder-decoder

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/avro/pull/103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #103


commit 79a2993151ea7589c06b854ee7ac8e951816ecce
Author: Ryan Blue 
Date:   2016-06-28T03:37:56Z

AVRO-1869: Java: Fix Decimal conversion from ByteBuffer.

commit 3ca6a15ddf75e4c39468ddd1d454331f3f54f1e3
Author: Ryan Blue 
Date:   2016-06-28T03:40:14Z

AVRO-1704: Java: Add type parameter to createDatumReader and Writer.

commit d91b90544f4486a72da8d3ff5b81dfc3c79d7c2f
Author: Ryan Blue 
Date:   2016-06-28T03:41:40Z

AVRO-1704: Java: Add DatumEncoder and SchemaStore.

commit 7fa75aab405c6460077d7cc7e403c664cce84431
Author: Ryan Blue 
Date:   2016-06-28T03:44:06Z

AVRO-1704: Java: Add toByteArray and fromByteArray to specific.




> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-05-05 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273205#comment-15273205
 ] 

Niels Basjes commented on AVRO-1704:


Thanks for the great feedback. 
I'm going to work on these points.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-05-04 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271360#comment-15271360
 ] 

Doug Cutting commented on AVRO-1704:


A few more comments:
- I think we can move all of the SpecificRecord#toBytes() and #fromBytes() code 
to SpecificRecordBase instead of generating it for each class.  I prefer to 
minimize generated code.  This might look like:{code}
public class SpecificRecordBase {
  ...
  public T fromBytes(byte[]) { return (T)...; }
}
public class Player extends SpecificRecordBase {
  ... 
}
{code}
- I suspect using DataInputStream and DataOutputStream in public APIs may be 
problematic for performance long-term.  Maybe the only public API in the first 
version should be 'T fromMessage(byte[])' and 'byte[] toMessage(T)'?  This can 
then be optimized, and, if needed a higher-performance lower-level API can be 
added.
- We should implement this API for more than just specific data.  This should 
work for generic data, Thrift, protobuf, etc., producing an identical format.  
So the base implementation should be passed a GenericData, which all of these 
inherit from, since it can create an appropriate DatumReader or DatumWriter.  
So this might look something like:{code}
package org.apache.avro.data;
public class MessageCoder {
   private GenericData data;
   public MessageCoder(GenericData data, MessageSchemaRepo repo) { this.data = 
data; }
   public byte[] toMessage(T object) { ... }
   public T fromMessage(byte[] bytes) { ... }
}{code}
 - Permitting alternate schema repos and alternate in-memory object 
representations is important, but supporting alternate message formats is not.  
The goal here is to standardize a message format, so I would not design things 
for extensibility on that axis.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-05-04 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271178#comment-15271178
 ] 

Doug Cutting commented on AVRO-1704:


A few quick comments:
- A prefix with non-printing characters has the benefit of making it clear this 
is binary data and should not be treated as text.  This may or may not matter 
here, but, for example, it is useful that there are non-printing characters at 
the start of a data file so that applications don't ever guess that these are 
text and subject to CRLF manipulation, etc.  Or, if instead, we want it to be 
printable, we should perhaps just use standard ASCII 'A' and '>'.  I don't see 
the advantage of using 'rare' printing characters, that just seems confusing to 
me.
- the changes to Schema#hashCode() may have performance implications, so we 
should at least run the Perf.java benchmarks before this is committed
- getFingerprint() needs javadoc
- invalidateHashes() is package-private, should be private
- SingleRecordSerializer is specific to SpecificRecord, so perhaps belongs in 
the specific package?
- Is this really for records only, or for any object?
- maybe the base class/interface should be called MessageEncoder instead of 
RecordSerializer, the package could be named 'message', and the storage could 
be called MessageSchemaRepo?
- the Xor example should be in a test package, not in the released library, no?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-05-03 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268928#comment-15268928
 ] 

Ryan Blue commented on AVRO-1704:
-

Yeah, sorry about not replying yet. I haven't gotten a great chance for a 
review.

My current thought is that I'm fine with 2 bytes and 0xC3. It seems strange to 
me to pick an arbitrary byte for the version, maybe it would be better to go 
with 0x00. Also, I have some code that I've been using that I want to compare 
with what you have here and think about the API since it will be a popular one.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-2016-05-03-Unfinished.patch, 
> AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-22 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254092#comment-15254092
 ] 

Niels Basjes commented on AVRO-1704:


Question: What would be the preferred way of handling error situations like 
* Unknown schema fingerprint
* Bad set of bytes (in various forms)

I see at least in two general directions:
# Return null
# Throw an error

What is preferred in this case?
Which is 'better' for the application developers?

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-19 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247442#comment-15247442
 ] 

Niels Basjes commented on AVRO-1704:


I agree with what you are saying. So the header should be shorter, but not too 
short.
I think that having only 1 byte is too short, 2 bytes should be fine: 1 marker 
byte, 1 body version byte.

So the updated proposal becomes:
* Header becomes 2 bytes in total. 'Ã' ''
** I chose the à (0xC3) because 
*** It is a 'human readable character' 
*** it looks like an 'A' (from Avro) under a 'Wave' and since currently the 
primary use case is streaming this seems like the right marker. 
*** Also this is a very uncommon character so if we see this the collision 
probability drops dramatically.
** The '' can be any byte that essentially defines the 
record structure that follows. This can be used to indicate for example the 
difference between a normal record and an encrypted record.
*** I think that we should also pick an 'uncommon' byte for this one to mark 
the default record version. I think this one is a good candicate: '»' (0xBB) 
because it looks like a symbol for 'fast'.
* The default body (i.e. version 0xBB) becomes 
** body: fingerprint record
*** fingerprint = CRC-64-AVRO(normalized schema) (8 bytes, little endian)
*** record = encoded Avro bytes using schema

So the overall record using the default body structure would look like this:
{code}
message = header body
 header = 'û' (== 0xC3 0xBB)
   body =  
{code}

In the generated code I'll see what can be done to make both the header and 
body code 'pluggable'.
I think that the Schema Storage should get a capped 'cache' (LRU?) that retains 
the fingerprints that are 'known to not exist'.



> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-17 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244991#comment-15244991
 ] 

Ryan Blue commented on AVRO-1704:
-

Sorry if what I said wasn't clear. I'm not proposing that we get rid of the 
header. I'm saying that we make it one byte instead of 4. I think what I 
outlined addresses the case where the schema cache miss is expensive and 
balances that with the per-message overhead. (I'm fine moving forward with the 
FP considered part of the body.)

A one-byte header results in lower than a 1/256 chance of an expensive lookup 
(by choosing carefully). Why is that too high? Why 4 bytes and not, for 
example, 2 for a 1/65536 chance?

I disagree that the impact of extra bytes is too small to matter. It (probably) 
won't cause fragmentation when sending one message, but we're not talking about 
just one message. Kafka's performance depends on batching records together for 
network operations and each message takes up space on disk. What matters is the 
percentage of data that is overhead. 4 bytes if your messages are 500 is 0.8%, 
and it is 4% if your messages are 100 bytes.

In terms of how much older data I can keep in a Kafka topic, that accounts for 
11m 30s to 57m 30s per day. If I provision for a 3-day window of data in Kafka, 
I'm losing between half an hour and 3 hours of that just to store 'Avr0' over 
and over. That's why I think we have to strike a balance between the two 
concerns. 1 or 2 bytes should really be sufficient, depending on the 
probability of a false-positive we want. And false-positives are only that 
costly if each one causes an RPC, which we can avoid with a little failure 
detection logic.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-17 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244947#comment-15244947
 ] 

Niels Basjes commented on AVRO-1704:


A few of the thoughts I had when creating the current patch:
# Regarding the 'Avro' header (which I still believe to be 'the way to go')
#* The cost of going to the Schema registry is high on a 'cache mis'. Problems 
like I ran into with STORM-512 will occur in other systems too and may very 
well cause an overload on the schema registry.
#* I consider the cost of a fixed header of 4 bytes to be low. But that really 
depends on the size of the record being transmitted (my records are in the 
500-1000 bytes range).
#** These extra bytes will only be persisted in streaming systems like Kafka. 
Long term file formats (like AVRO, Parquet and ORC) won't store this.
#** In network traffic the overhead is 'unmeasurably small' because it is 
unlikely the record will go over the size of a single TCP packet (1500) because 
of these 4 bytes.
# Regarding the schema fingerprint (which I consider a 'body' part).
#* The idea of the 'version' was that someone may want to use a different 
'hash' instead of the CRC-64-AVRO.
#* I think that in case of encryption we should have the fingerprint encrypted 
too.

*In light of the encryption option and your comments I'm now considering this 
_brainwave_*:
* The 'header of the message' should be pluggable.
** The default is a 'fixed shape' which includes a format id. (Same as what my 
current patch does).
** I expect that making this pluggable too is possible but that would have some 
restrictions like "all records of a schema must adhere to the same base format".
* The 'body of the message' should be pluggable too. 
** Format '0' is hardcoded (fingerprint+record). 
** Yet other versions (we should define a range like 0x80-0xFF) can be used by 
anyone to define a custom body definition (including encryption). I expect 
these versions to only exist within a specific company. If they need to 
exchange data with others they should share their format specification anyway.
* If we set the code up right we can have a layering system: I.e. someone can 
'insert' an encryption layer and still use the 'standard' body (after 
decryption).
** Such an 'encryption layer' would add additional parts like a encryption type 
and a key id.


> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-16 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244347#comment-15244347
 ] 

Ryan Blue commented on AVRO-1704:
-

Looks like I was a little too optimistic about time to review things this week. 
Sorry for the delay. I think we're close to a spec. Here are some additional 
thoughts.

Looks like everyone is for using the CRC-64-AVRO fingerprint, which is good 
because it can be implemented in each language and doesn't require a library 
dependency. That's also what's often used in practice.

+1 for an interface in Avro that lets you plug in a schema resolver.

I think the fingerprint should be considered part of the header rather than the 
body. It's a small distinction, but the fingerprint is a proxy for the schema 
here and the body/payload depends on it. Schema is in the container file 
header, so it is consistent.

I want to avoid a 4-byte sentinel value in each message. There are two uses for 
it: to make sure the message is Avro and to communicate the format version 
should we want to change it later.

Because the schema fingerprint is included in the message, it is very unlikely 
that unknown payloads will be read as Avro messages because it requires a 
collision with an 8-byte schema fingerprint. I think that's plenty of 
protection from passing along corrupt data. The concern that doesn't address is 
what happens when a fingerprint is unknown, which is a lot of cases will cause 
a REST call to resolve it. I don't think adding 4 bytes to every encoded 
payload is worth avoiding this case when the lookup can detect some number of 
failures and stop making the RPC calls. I just don't think we should design 
around a solvable problem in the format like that.

I think the second use, versioning the format, is a good idea. That only 
requires one byte and including that byte can also serve as a way to detect 
non-Avro payloads, just with a higher probability of collision. I think that's 
a reasonable compromise. There would be something a 1/256 chance that the first 
byte collides, assuming that byte is random in the non-Avro payload. That 
dramatically reduces the problem of making RPC calls to resolve unknown schema 
FPs. We want to choose the version byte carefully because other formats could 
easily have 0x00, 0x01, or an ASCII character there. I propose the version 
number with the MSB set, 0x80. That's unlikely to conflict with a flags byte, 
the first byte of a number, or the first character of a string.

That makes the format:
{code}
message = header body
 header = 0x80 CRC-64-AVRO(schema) (8 bytes, little endian)
   body = encoded Avro bytes using schema
{code}

We could additionally have a format with a 4-byte FP, version 0x81, if anyone 
is interested in it. Something simple like XOR the first 4 bytes with the 
second 4 bytes of the CRC-64-AVRO fingerprint. 8 bytes just seems like a lot 
when this gets scaled up to billions of records.

One last thought: in the implementation, it would be nice to allow skipping the 
version byte because a lot of people have already implemented this as 
CRC-64-AVRO + encoded bytes. That would make the Avro implementation compatible 
with existing data flows and increase the chances that we can move to this 
standard format.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is 

[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-13 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238996#comment-15238996
 ] 

Niels Basjes commented on AVRO-1704:


I have a first addition: Think about supporting encrytion. 

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-04-11 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235438#comment-15235438
 ] 

Ryan Blue commented on AVRO-1704:
-

Thanks for working on this, Niels. I'll make some comments later today or 
tomorrow.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>Assignee: Niels Basjes
> Attachments: AVRO-1704-20160410.patch
>
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-03-24 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210535#comment-15210535
 ] 

Niels Basjes commented on AVRO-1704:


I did some experimenting over the last week and I posted my changed version of 
Avro here: https://github.com/nielsbasjes/avro/tree/AVRO-1704

What I did so far:
# Added to Schema the getFingerPrint() method that uses the CRC-64-AVRO to 
calculate the schema finger print.
# Added a few SchemaStorage related classes that allow storing schemas in 
memory.
# Added to the generated classes the toBytes() method and the fromBytes static 
method. Both effectively call the 'real' implementations which are in the 
SpecificRecordBase class.

All of this passes all of the Java unit testing.

At the application end my test code (using 3 slightly different variations of 
the same schema) looks like this. 
This works exactly as I expect it to.
{code:java}
SchemaFactory.put(com.bol.measure.v1.Measurement.getClassSchema());
SchemaFactory.put(com.bol.measure.v2.Measurement.getClassSchema());
SchemaFactory.put(com.bol.measure.v3.Measurement.getClassSchema());

com.bol.measure.v1.Measurement measurement = 
DummyMeasurementFactory.createTestMeasurement(timestamp);
byte[] bytesV1 = measurement.toBytes();

com.bol.measure.v2.Measurement newBornV2 = 
com.bol.measure.v2.Measurement.fromBytes(bytesV1);
com.bol.measure.v3.Measurement newBornV3 = 
com.bol.measure.v3.Measurement.fromBytes(bytesV1);
{code}

Things currently missing: Documentation, extra tests, etc.

I could really use some feedback on the structure of my change and advice on 
how to approach the need to call a 'close()' method on the schema storage part.

Thanks.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-03-11 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190866#comment-15190866
 ] 

Niels Basjes commented on AVRO-1704:


Thanks for pointing this out. 

My updated proposal for this:
{code}"Avro"{code}
Where 
# "version" = 1 byte indicating the version (or "schema") of the rest of the 
bytes. 
if version == 0x00
# "Fingerprint" = the CRC-64-AVRO of the Canonical form of the Schema.
# "Record" = the record serialized to byte using the existing serialization 
system.

I personally do not like these 'chopped' prefixes if there is no "really good 
reason to chop them" (like the length). 
Because the projects name is so short: In this proposal I'm sticking to using 
the full name of the project as the prefix: "Avro" (i.e. these 4 bytes 0x41, 
0x76, 0x72, 0x6F)


> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-03-10 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189594#comment-15189594
 ] 

Doug Cutting commented on AVRO-1704:


bq. remove the things that do not impact the binary form of the record

This is already done as part of fingerprint calculation.

https://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

Also, if we opt for a prefix, we might use something like 'A'+'v'+'r'+0, where 
the last character also indicates the format version, including schema hash 
function.  That's similar to what's used to label the file format, and has a 
side benefit of clearly demonstrating that this is binary, non-textual data.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-03-10 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189473#comment-15189473
 ] 

Niels Basjes commented on AVRO-1704:


Note that having the "AVRO" prefix will also limit the number of needless calls 
to the Schema registry when bad records are put into the stream (like the Timer 
ticks example).

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-03-10 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189402#comment-15189402
 ] 

Niels Basjes commented on AVRO-1704:


I've been looking into what kind of solution would work here since I'm working 
on a project where we need datastructures going into Kafka and be available to 
multiple consumers.

The fundamental problem we need to solve is that of "Schema Evolution" in a 
streaming environment (Let's assume Kafka with the built in persistence of 
records).
We need three things to make this happen:
# A way to recognize a 'blob' is a serialized AVRO record.
#* We can simply assume it is always an AVRO record. 
#* I think we should simply let such a record start with "AVRO" to ensure we 
can cleanly catch problems like this STORM-512 (Summary: Timer ticks we written 
into Kafka which caused a lot of deserialization errors in reading the AVRO 
records.)
# A way to determine the schema this was written with.
#* As indicated above I vote for using the CRC-64-AVRO. 
#** I noticed that a simple typo fix in the documentation of a Schema causes a 
new fingerprint to be generated. 
#** Proposal: I think we should 'clean' the schema before calculating the 
fingerprint. I.e. remove the things that do not impact the binary form of the 
record (like the doc field).
# Have a place where we can find the schemas using the fingerprint as the key.
#* Here I think (looking at AVRO-1124 and the fact that there are ready to run 
implementations like this [Schema 
Registry|http://docs.confluent.io/current/schema-registry/docs/index.html]) we 
should limit what we keep inside Avro to something like a "SchemaFactory" 
interface (as the storage/retrieval interface to get a Schema) and a very basic 
implementation that simply reads the available schema's from a (set of) 
property file(s). Using this others can write additional implementations that 
can read/write to things like databases or the above mentioned Schema Registry.

So to summarize my proposal on the standard for the {{Single record 
serialization format}} can be written as:
{code}"AVRO"{code}

[~rdblue], I'm seeking feedback from you guys on this proposal. 


> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2016-02-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133238#comment-15133238
 ] 

ASF GitHub Bot commented on AVRO-1704:
--

Github user asfgit closed the pull request at:

https://github.com/apache/avro/pull/43


> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2015-09-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906640#comment-14906640
 ] 

Ryan Blue commented on AVRO-1704:
-

[~dasch], I think the most common one is CRC-64-AVRO. That's exactly why we 
need to standardize this though. I think we should go with just one and it 
would be good to have confirmation from the Kafka and Flume communities on 
which one they currently use.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2015-09-21 Thread Daniel Schierbeck (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900357#comment-14900357
 ] 

Daniel Schierbeck commented on AVRO-1704:
-

[~rdblue] If there's already widespread usage of `` then I can 
simply implement that.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2015-09-11 Thread Daniel Schierbeck (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740476#comment-14740476
 ] 

Daniel Schierbeck commented on AVRO-1704:
-

I think it's fine to standardize on a single fingerprint type. As for the 
metadata map, I was thinking that it would be nice for generic tools to use, 
e.g. keeping track of Kafka offsets and partitions when moving encoded data 
around. It's not a requirement, though, so if it's easier to get traction 
without it I wouldn't mind.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2015-09-11 Thread Daniel Schierbeck (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740477#comment-14740477
 ] 

Daniel Schierbeck commented on AVRO-1704:
-

If we can agree on a format I can do the Ruby implementation.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2015-09-10 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739165#comment-14739165
 ] 

Ryan Blue commented on AVRO-1704:
-

I think this is a good idea. Quite a few people are doing this already, but 
with ad-hoc formats. [~granthenke] and [~gwenshap] are probably interested in 
this topic as well.

I think the one that is the most widely used is simply the 8-byte schema 
fingerprint from Java (SHA256?) followed by the encoded bytes. For 
compatibility with existing data in Kafka, I'd recommend going with that unless 
we have good reason to change it. I think it's better to specify the 
fingerprint ahead of time so we don't waste space encoding which one (or 
requiring more complicated code).

That leaves the format version number and metadata map, keeping in mind that if 
we decide we need either one then we are breaking compatibility with existing 
data and tools -- that's not too bad, but we should be aware of it. I like the 
idea of a format version number, but it might be unnecessary. I'm interested to 
hear what you envision the key/value metadata would be used for, too.

> Standardized format for encoding messages with Avro
> ---
>
> Key: AVRO-1704
> URL: https://issues.apache.org/jira/browse/AVRO-1704
> Project: Avro
>  Issue Type: Improvement
>Reporter: Daniel Schierbeck
>
> I'm currently using the Datafile format for encoding messages that are 
> written to Kafka and Cassandra. This seems rather wasteful:
> 1. I only encode a single record at a time, so there's no need for sync 
> markers and other metadata related to multi-record files.
> 2. The entire schema is inlined every time.
> However, the Datafile format is the only one that has been standardized, 
> meaning that I can read and write data with minimal effort across the various 
> languages in use in my organization. If there was a standardized format for 
> encoding single values that was optimized for out-of-band schema transfer, I 
> would much rather use that.
> I think the necessary pieces of the format would be:
> 1. A format version number.
> 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
> 3. The actual schema fingerprint (according to the type.)
> 4. Optional metadata map.
> 5. The encoded datum.
> The language libraries would implement a MessageWriter that would encode 
> datums in this format, as well as a MessageReader that, given a SchemaStore, 
> would be able to decode datums. The reader would decode the fingerprint and 
> ask its SchemaStore to return the corresponding writer's schema.
> The idea is that SchemaStore would be an abstract interface that allowed 
> library users to inject custom backends. A simple, file system based one 
> could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AVRO-1704) Standardized format for encoding messages with Avro

2015-07-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629488#comment-14629488
 ] 

ASF GitHub Bot commented on AVRO-1704:
--

GitHub user dasch opened a pull request:

https://github.com/apache/avro/pull/43

AVRO-1704: Standardized format for encoding messages with Avro

This is a proof of concept implementation of 
[AVRO-1704](https://issues.apache.org/jira/browse/AVRO-1704).

- The fingerprint implementation is mocked out.
- Only 64-bit fingerprints are supported.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dasch/avro dasch/message-format

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/avro/pull/43.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #43


commit 5765e59879e2c70ec2095dd666105d26e0d592fc
Author: Daniel Schierbeck da...@zendesk.com
Date:   2015-07-16T09:05:38Z

Add the Avro::Message format

commit f1286548ebf0e2b8ef50d604251fcfbd70137b8b
Author: Daniel Schierbeck da...@zendesk.com
Date:   2015-07-16T09:28:03Z

Add SchemaStore

Currently it's using a mock fingerprint implementation and only stores
64-bit fingerprints.




 Standardized format for encoding messages with Avro
 ---

 Key: AVRO-1704
 URL: https://issues.apache.org/jira/browse/AVRO-1704
 Project: Avro
  Issue Type: Improvement
Reporter: Daniel Schierbeck

 I'm currently using the Datafile format for encoding messages that are 
 written to Kafka and Cassandra. This seems rather wasteful:
 1. I only encode a single record at a time, so there's no need for sync 
 markers and other metadata related to multi-record files.
 2. The entire schema is inlined every time.
 However, the Datafile format is the only one that has been standardized, 
 meaning that I can read and write data with minimal effort across the various 
 languages in use in my organization. If there was a standardized format for 
 encoding single values that was optimized for out-of-band schema transfer, I 
 would much rather use that.
 I think the necessary pieces of the format would be:
 1. A format version number.
 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
 3. The actual schema fingerprint (according to the type.)
 4. Optional metadata map.
 5. The encoded datum.
 The language libraries would implement a MessageWriter that would encode 
 datums in this format, as well as a MessageReader that, given a SchemaStore, 
 would be able to decode datums. The reader would decode the fingerprint and 
 ask its SchemaStore to return the corresponding writer's schema.
 The idea is that SchemaStore would be an abstract interface that allowed 
 library users to inject custom backends. A simple, file system based one 
 could be provided out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)