[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

2016-05-27 Thread Ewen Cheslack-Postava (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303666#comment-15303666
 ] 

Ewen Cheslack-Postava commented on KAFKA-3744:
--

Just to second [~ijuma]'s comments, this absolutely needs a KIP. "Affects the 
format" doesn't quite capture the requirements for a KIP. Even things that 
affect semantics but don't strictly affect format are subject to KIPs. The end 
result of the KIP could be that it doesn't affect older clients that simply 
ignore those bits, but its still really important to have that discussion and 
make sure that's an acceptable path.

Re: the specific proposal, I'm skeptical. Magic bytes are a *very* common 
approach for format detection and don't require any specialized support, are 
used by a lot of people today, and seems to work fine in practice. From my 
reading, the proposal also assumes that key and value serialization is the 
same, which it turns out is not the case for many users (and I have found this 
in practice a lot based on issues filed against Confluent's REST proxy where 
people want simple serialization for keys, e.g. UTF8 strings, and complex 
serialization for values, e.g. GenericRecords). Formats like JSON are the main 
exception here re: magic bytes. My impression is that folks that actually think 
about multiple formats realize up front that you need magic bytes and include 
it. If you use something like JSON, you tend to track this somehow externally 
such that you know based on topics what format you're using. I'm not convinced 
of the benefit here.
 

> Message format needs to identify serializer
> ---
>
> Key: KAFKA-3744
> URL: https://issues.apache.org/jira/browse/KAFKA-3744
> Project: Kafka
>  Issue Type: Improvement
>Reporter: David Kay
>Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with 
> https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new 
> users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text 
> from the command line. Beginner's guide 
> (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign 
> Slide 104 says:
> {noformat}
>Kafka does not care about data format of msg payload
>Up to developer to handle serialization/deserialization
>   Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a 
> third producer sends JSON, and a fourth sends CBOR, how does the consumer 
> identify which deserializer to use for the payload?  The commit includes an 
> opaque K byte Key that could potentially include a codec identifier, but 
> provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great 
> deal of progress being made on serialization libraries right now, and any 
> particular choice is unlikely to be right for all uses. Needless to say a 
> particular application using Kafka would likely mandate a particular 
> serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a 
> single mime-type for all web content.  There must be a way to signal the 
> serialization used to produce this message's V byte payload, and documenting 
> the existence of even a rudimentary codec registry with a few values (text, 
> Avro, JSON, CBOR) would establish the pattern to be used for future 
> serialization libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

2016-05-25 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300130#comment-15300130
 ] 

Ismael Juma commented on KAFKA-3744:


Also, I took a look at the PR and it's not clear to me why `avro-binary` is 
given preferential treatment:

{code}
0 and 1 specify two payload encodings (text and avro-binary); key format is 
unspecified.
 +  2 specifies that the key must be a JSON object with a property "t" c
{code}

> Message format needs to identify serializer
> ---
>
> Key: KAFKA-3744
> URL: https://issues.apache.org/jira/browse/KAFKA-3744
> Project: Kafka
>  Issue Type: Improvement
>Reporter: David Kay
>Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with 
> https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new 
> users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text 
> from the command line. Beginner's guide 
> (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign 
> Slide 104 says:
> {noformat}
>Kafka does not care about data format of msg payload
>Up to developer to handle serialization/deserialization
>   Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a 
> third producer sends JSON, and a fourth sends CBOR, how does the consumer 
> identify which deserializer to use for the payload?  The commit includes an 
> opaque K byte Key that could potentially include a codec identifier, but 
> provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great 
> deal of progress being made on serialization libraries right now, and any 
> particular choice is unlikely to be right for all uses. Needless to say a 
> particular application using Kafka would likely mandate a particular 
> serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a 
> single mime-type for all web content.  There must be a way to signal the 
> serialization used to produce this message's V byte payload, and documenting 
> the existence of even a rudimentary codec registry with a few values (text, 
> Avro, JSON, CBOR) would establish the pattern to be used for future 
> serialization libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

2016-05-25 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300119#comment-15300119
 ] 

Ismael Juma commented on KAFKA-3744:


It changes the message format so it needs a KIP. :) The KIP page even says: "We 
need to spend significantly more time on log format and protocol". Once those 
two bits are used for the purpose you propose, they cannot be used for anything 
else, so we take such changes very seriously (we don't have many free bits left 
as you can see).

As I said, it may be worth just asking for feedback on the mailing list before 
writing a complete KIP if you'd like to get some feedback before spending the 
time on it.

> Message format needs to identify serializer
> ---
>
> Key: KAFKA-3744
> URL: https://issues.apache.org/jira/browse/KAFKA-3744
> Project: Kafka
>  Issue Type: Improvement
>Reporter: David Kay
>Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with 
> https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new 
> users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text 
> from the command line. Beginner's guide 
> (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign 
> Slide 104 says:
> {noformat}
>Kafka does not care about data format of msg payload
>Up to developer to handle serialization/deserialization
>   Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a 
> third producer sends JSON, and a fourth sends CBOR, how does the consumer 
> identify which deserializer to use for the payload?  The commit includes an 
> opaque K byte Key that could potentially include a codec identifier, but 
> provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great 
> deal of progress being made on serialization libraries right now, and any 
> particular choice is unlikely to be right for all uses. Needless to say a 
> particular application using Kafka would likely mandate a particular 
> serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a 
> single mime-type for all web content.  There must be a way to signal the 
> serialization used to produce this message's V byte payload, and documenting 
> the existence of even a rudimentary codec registry with a few values (text, 
> Avro, JSON, CBOR) would establish the pattern to be used for future 
> serialization libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

2016-05-25 Thread David Kay (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300075#comment-15300075
 ] 

David Kay commented on KAFKA-3744:
--

Hi Ismael, thanks for the reply.

I can submit a KIP, but I don't believe this proposal meets the stated 
requirements for a KIP.  It does not change either the message format or the 
on-disk format in any manner that would affect current software.  The currently 
documented structure includes a one-byte "attributes" field that defines bits 
0-3 and reserves bits 4-7 for future use.  This proposal assigns meaning to 
bits 4-5 which were previously undefined, and leaves bits 6-7 reserved for 
future use.

All current producer, consumer, and messaging software would continue to run 
unchanged if this proposal were adopted.  Future producers and consumers could 
optionally use the two attribute bits, but the messaging software is unaffected 
by whether those bits remain undefined or are used for something.

Let me know if you think a KIP would still be helpful.

> Message format needs to identify serializer
> ---
>
> Key: KAFKA-3744
> URL: https://issues.apache.org/jira/browse/KAFKA-3744
> Project: Kafka
>  Issue Type: Improvement
>Reporter: David Kay
>Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with 
> https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new 
> users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text 
> from the command line. Beginner's guide 
> (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign 
> Slide 104 says:
> {noformat}
>Kafka does not care about data format of msg payload
>Up to developer to handle serialization/deserialization
>   Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a 
> third producer sends JSON, and a fourth sends CBOR, how does the consumer 
> identify which deserializer to use for the payload?  The commit includes an 
> opaque K byte Key that could potentially include a codec identifier, but 
> provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great 
> deal of progress being made on serialization libraries right now, and any 
> particular choice is unlikely to be right for all uses. Needless to say a 
> particular application using Kafka would likely mandate a particular 
> serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a 
> single mime-type for all web content.  There must be a way to signal the 
> serialization used to produce this message's V byte payload, and documenting 
> the existence of even a rudimentary codec registry with a few values (text, 
> Avro, JSON, CBOR) would establish the pattern to be used for future 
> serialization libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

2016-05-24 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297898#comment-15297898
 ] 

Ismael Juma commented on KAFKA-3744:


Hi [~davek22]. A change to the message format would require a KIP:

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

You may also choose to email the mailing list before doing the KIP to get 
feedback from a wider group. There are other ways of achieving something like 
this (eg https://github.com/confluentinc/schema-registry) with different 
trade-offs.

> Message format needs to identify serializer
> ---
>
> Key: KAFKA-3744
> URL: https://issues.apache.org/jira/browse/KAFKA-3744
> Project: Kafka
>  Issue Type: Improvement
>Reporter: David Kay
>Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with 
> https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new 
> users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text 
> from the command line. Beginner's guide 
> (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign 
> Slide 104 says:
> {noformat}
>Kafka does not care about data format of msg payload
>Up to developer to handle serialization/deserialization
>   Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a 
> third producer sends JSON, and a fourth sends CBOR, how does the consumer 
> identify which deserializer to use for the payload?  The commit includes an 
> opaque K byte Key that could potentially include a codec identifier, but 
> provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great 
> deal of progress being made on serialization libraries right now, and any 
> particular choice is unlikely to be right for all uses. Needless to say a 
> particular application using Kafka would likely mandate a particular 
> serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a 
> single mime-type for all web content.  There must be a way to signal the 
> serialization used to produce this message's V byte payload, and documenting 
> the existence of even a rudimentary codec registry with a few values (text, 
> Avro, JSON, CBOR) would establish the pattern to be used for future 
> serialization libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer

2016-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297473#comment-15297473
 ] 

ASF GitHub Bot commented on KAFKA-3744:
---

GitHub user davek2 opened a pull request:

https://github.com/apache/kafka/pull/1419

Allocate 2 attribute bits to signal payload format

This documentation update proposes a mechanism to signal the serialization 
used for the message payload, resolving issue 
https://issues.apache.org/jira/browse/KAFKA-3744.  No change is made to the 
message structure; two previously-reserved bits in the attribute byte now have 
defined values, and for one of four cases the key field is defined to be a JSON 
object.

No change is required to messaging software.   No change is required to 
existing producer and consumer clients that use pre-agreed payload 
serialization. 

Misc notes:
1) Only one attribute bit would be needed if serialization were always 
signalled using the key field.  But it seems preferable to define two common 
serializations that do not have any dependency on the key field.  Selection of 
the common formats is arbitrary; text and avro seem reasonable but any two 
could be used instead.
2) The compression attribute uses three bits but only two are defined.  If 
the intent is to use all three bits for compression the undefined values should 
be listed as reserved; if not, the timestamp attribute can slide down to bit 2 
and serialization to bits 3~4, leaving bits 5~7 reserved.
3) It's unclear why message field 6 should be called "key" - a 
variable-length field is more likely to be described as "attributes" or 
"metadata", and 1-byte field 3 would be called "flags" instead of "attributes".
4) Field 8 is called "payload" under message format and "value" under 
on-disk format.  Changed to payload in both places.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davek2/kafka trunk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/1419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1419


commit 1d88b8d48cdfe67989bebf239f7588ca24e961b6
Author: Joe 
Date:   2016-05-24T00:32:04Z

Allocate 2 attribute bits for payload format




> Message format needs to identify serializer
> ---
>
> Key: KAFKA-3744
> URL: https://issues.apache.org/jira/browse/KAFKA-3744
> Project: Kafka
>  Issue Type: Improvement
>Reporter: David Kay
>Priority: Minor
>
> https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with 
> https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0.
> But Kafka documentation on message formats needs to be more explicit for new 
> users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text 
> from the command line. Beginner's guide 
> (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign 
> Slide 104 says:
> {noformat}
>Kafka does not care about data format of msg payload
>Up to developer to handle serialization/deserialization
>   Common choices: Avro, JSON
> {noformat}
> If one producer sends lines of console text, another producer sends Avro, a 
> third producer sends JSON, and a fourth sends CBOR, how does the consumer 
> identify which deserializer to use for the payload?  The commit includes an 
> opaque K byte Key that could potentially include a codec identifier, but 
> provides no guidance on how to use it:
> {quote}
> "Leaving the key and value opaque is the right decision: there is a great 
> deal of progress being made on serialization libraries right now, and any 
> particular choice is unlikely to be right for all uses. Needless to say a 
> particular application using Kafka would likely mandate a particular 
> serialization type as part of its usage."
> {quote}
> Mandating any particular serialization is as unrealistic as mandating a 
> single mime-type for all web content.  There must be a way to signal the 
> serialization used to produce this message's V byte payload, and documenting 
> the existence of even a rudimentary codec registry with a few values (text, 
> Avro, JSON, CBOR) would establish the pattern to be used for future 
> serialization libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)