[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer
[ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303666#comment-15303666 ] Ewen Cheslack-Postava commented on KAFKA-3744: -- Just to second [~ijuma]'s comments, this absolutely needs a KIP. "Affects the format" doesn't quite capture the requirements for a KIP. Even things that affect semantics but don't strictly affect format are subject to KIPs. The end result of the KIP could be that it doesn't affect older clients that simply ignore those bits, but its still really important to have that discussion and make sure that's an acceptable path. Re: the specific proposal, I'm skeptical. Magic bytes are a *very* common approach for format detection and don't require any specialized support, are used by a lot of people today, and seems to work fine in practice. From my reading, the proposal also assumes that key and value serialization is the same, which it turns out is not the case for many users (and I have found this in practice a lot based on issues filed against Confluent's REST proxy where people want simple serialization for keys, e.g. UTF8 strings, and complex serialization for values, e.g. GenericRecords). Formats like JSON are the main exception here re: magic bytes. My impression is that folks that actually think about multiple formats realize up front that you need magic bytes and include it. If you use something like JSON, you tend to track this somehow externally such that you know based on topics what format you're using. I'm not convinced of the benefit here. > Message format needs to identify serializer > --- > > Key: KAFKA-3744 > URL: https://issues.apache.org/jira/browse/KAFKA-3744 > Project: Kafka > Issue Type: Improvement >Reporter: David Kay >Priority: Minor > > https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with > https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0. > But Kafka documentation on message formats needs to be more explicit for new > users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text > from the command line. Beginner's guide > (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign > Slide 104 says: > {noformat} >Kafka does not care about data format of msg payload >Up to developer to handle serialization/deserialization > Common choices: Avro, JSON > {noformat} > If one producer sends lines of console text, another producer sends Avro, a > third producer sends JSON, and a fourth sends CBOR, how does the consumer > identify which deserializer to use for the payload? The commit includes an > opaque K byte Key that could potentially include a codec identifier, but > provides no guidance on how to use it: > {quote} > "Leaving the key and value opaque is the right decision: there is a great > deal of progress being made on serialization libraries right now, and any > particular choice is unlikely to be right for all uses. Needless to say a > particular application using Kafka would likely mandate a particular > serialization type as part of its usage." > {quote} > Mandating any particular serialization is as unrealistic as mandating a > single mime-type for all web content. There must be a way to signal the > serialization used to produce this message's V byte payload, and documenting > the existence of even a rudimentary codec registry with a few values (text, > Avro, JSON, CBOR) would establish the pattern to be used for future > serialization libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer
[ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300130#comment-15300130 ] Ismael Juma commented on KAFKA-3744: Also, I took a look at the PR and it's not clear to me why `avro-binary` is given preferential treatment: {code} 0 and 1 specify two payload encodings (text and avro-binary); key format is unspecified. + 2 specifies that the key must be a JSON object with a property "t" c {code} > Message format needs to identify serializer > --- > > Key: KAFKA-3744 > URL: https://issues.apache.org/jira/browse/KAFKA-3744 > Project: Kafka > Issue Type: Improvement >Reporter: David Kay >Priority: Minor > > https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with > https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0. > But Kafka documentation on message formats needs to be more explicit for new > users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text > from the command line. Beginner's guide > (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign > Slide 104 says: > {noformat} >Kafka does not care about data format of msg payload >Up to developer to handle serialization/deserialization > Common choices: Avro, JSON > {noformat} > If one producer sends lines of console text, another producer sends Avro, a > third producer sends JSON, and a fourth sends CBOR, how does the consumer > identify which deserializer to use for the payload? The commit includes an > opaque K byte Key that could potentially include a codec identifier, but > provides no guidance on how to use it: > {quote} > "Leaving the key and value opaque is the right decision: there is a great > deal of progress being made on serialization libraries right now, and any > particular choice is unlikely to be right for all uses. Needless to say a > particular application using Kafka would likely mandate a particular > serialization type as part of its usage." > {quote} > Mandating any particular serialization is as unrealistic as mandating a > single mime-type for all web content. There must be a way to signal the > serialization used to produce this message's V byte payload, and documenting > the existence of even a rudimentary codec registry with a few values (text, > Avro, JSON, CBOR) would establish the pattern to be used for future > serialization libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer
[ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300119#comment-15300119 ] Ismael Juma commented on KAFKA-3744: It changes the message format so it needs a KIP. :) The KIP page even says: "We need to spend significantly more time on log format and protocol". Once those two bits are used for the purpose you propose, they cannot be used for anything else, so we take such changes very seriously (we don't have many free bits left as you can see). As I said, it may be worth just asking for feedback on the mailing list before writing a complete KIP if you'd like to get some feedback before spending the time on it. > Message format needs to identify serializer > --- > > Key: KAFKA-3744 > URL: https://issues.apache.org/jira/browse/KAFKA-3744 > Project: Kafka > Issue Type: Improvement >Reporter: David Kay >Priority: Minor > > https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with > https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0. > But Kafka documentation on message formats needs to be more explicit for new > users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text > from the command line. Beginner's guide > (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign > Slide 104 says: > {noformat} >Kafka does not care about data format of msg payload >Up to developer to handle serialization/deserialization > Common choices: Avro, JSON > {noformat} > If one producer sends lines of console text, another producer sends Avro, a > third producer sends JSON, and a fourth sends CBOR, how does the consumer > identify which deserializer to use for the payload? The commit includes an > opaque K byte Key that could potentially include a codec identifier, but > provides no guidance on how to use it: > {quote} > "Leaving the key and value opaque is the right decision: there is a great > deal of progress being made on serialization libraries right now, and any > particular choice is unlikely to be right for all uses. Needless to say a > particular application using Kafka would likely mandate a particular > serialization type as part of its usage." > {quote} > Mandating any particular serialization is as unrealistic as mandating a > single mime-type for all web content. There must be a way to signal the > serialization used to produce this message's V byte payload, and documenting > the existence of even a rudimentary codec registry with a few values (text, > Avro, JSON, CBOR) would establish the pattern to be used for future > serialization libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer
[ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300075#comment-15300075 ] David Kay commented on KAFKA-3744: -- Hi Ismael, thanks for the reply. I can submit a KIP, but I don't believe this proposal meets the stated requirements for a KIP. It does not change either the message format or the on-disk format in any manner that would affect current software. The currently documented structure includes a one-byte "attributes" field that defines bits 0-3 and reserves bits 4-7 for future use. This proposal assigns meaning to bits 4-5 which were previously undefined, and leaves bits 6-7 reserved for future use. All current producer, consumer, and messaging software would continue to run unchanged if this proposal were adopted. Future producers and consumers could optionally use the two attribute bits, but the messaging software is unaffected by whether those bits remain undefined or are used for something. Let me know if you think a KIP would still be helpful. > Message format needs to identify serializer > --- > > Key: KAFKA-3744 > URL: https://issues.apache.org/jira/browse/KAFKA-3744 > Project: Kafka > Issue Type: Improvement >Reporter: David Kay >Priority: Minor > > https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with > https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0. > But Kafka documentation on message formats needs to be more explicit for new > users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text > from the command line. Beginner's guide > (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign > Slide 104 says: > {noformat} >Kafka does not care about data format of msg payload >Up to developer to handle serialization/deserialization > Common choices: Avro, JSON > {noformat} > If one producer sends lines of console text, another producer sends Avro, a > third producer sends JSON, and a fourth sends CBOR, how does the consumer > identify which deserializer to use for the payload? The commit includes an > opaque K byte Key that could potentially include a codec identifier, but > provides no guidance on how to use it: > {quote} > "Leaving the key and value opaque is the right decision: there is a great > deal of progress being made on serialization libraries right now, and any > particular choice is unlikely to be right for all uses. Needless to say a > particular application using Kafka would likely mandate a particular > serialization type as part of its usage." > {quote} > Mandating any particular serialization is as unrealistic as mandating a > single mime-type for all web content. There must be a way to signal the > serialization used to produce this message's V byte payload, and documenting > the existence of even a rudimentary codec registry with a few values (text, > Avro, JSON, CBOR) would establish the pattern to be used for future > serialization libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer
[ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297898#comment-15297898 ] Ismael Juma commented on KAFKA-3744: Hi [~davek22]. A change to the message format would require a KIP: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals You may also choose to email the mailing list before doing the KIP to get feedback from a wider group. There are other ways of achieving something like this (eg https://github.com/confluentinc/schema-registry) with different trade-offs. > Message format needs to identify serializer > --- > > Key: KAFKA-3744 > URL: https://issues.apache.org/jira/browse/KAFKA-3744 > Project: Kafka > Issue Type: Improvement >Reporter: David Kay >Priority: Minor > > https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with > https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0. > But Kafka documentation on message formats needs to be more explicit for new > users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text > from the command line. Beginner's guide > (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign > Slide 104 says: > {noformat} >Kafka does not care about data format of msg payload >Up to developer to handle serialization/deserialization > Common choices: Avro, JSON > {noformat} > If one producer sends lines of console text, another producer sends Avro, a > third producer sends JSON, and a fourth sends CBOR, how does the consumer > identify which deserializer to use for the payload? The commit includes an > opaque K byte Key that could potentially include a codec identifier, but > provides no guidance on how to use it: > {quote} > "Leaving the key and value opaque is the right decision: there is a great > deal of progress being made on serialization libraries right now, and any > particular choice is unlikely to be right for all uses. Needless to say a > particular application using Kafka would likely mandate a particular > serialization type as part of its usage." > {quote} > Mandating any particular serialization is as unrealistic as mandating a > single mime-type for all web content. There must be a way to signal the > serialization used to produce this message's V byte payload, and documenting > the existence of even a rudimentary codec registry with a few values (text, > Avro, JSON, CBOR) would establish the pattern to be used for future > serialization libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-3744) Message format needs to identify serializer
[ https://issues.apache.org/jira/browse/KAFKA-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297473#comment-15297473 ] ASF GitHub Bot commented on KAFKA-3744: --- GitHub user davek2 opened a pull request: https://github.com/apache/kafka/pull/1419 Allocate 2 attribute bits to signal payload format This documentation update proposes a mechanism to signal the serialization used for the message payload, resolving issue https://issues.apache.org/jira/browse/KAFKA-3744. No change is made to the message structure; two previously-reserved bits in the attribute byte now have defined values, and for one of four cases the key field is defined to be a JSON object. No change is required to messaging software. No change is required to existing producer and consumer clients that use pre-agreed payload serialization. Misc notes: 1) Only one attribute bit would be needed if serialization were always signalled using the key field. But it seems preferable to define two common serializations that do not have any dependency on the key field. Selection of the common formats is arbitrary; text and avro seem reasonable but any two could be used instead. 2) The compression attribute uses three bits but only two are defined. If the intent is to use all three bits for compression the undefined values should be listed as reserved; if not, the timestamp attribute can slide down to bit 2 and serialization to bits 3~4, leaving bits 5~7 reserved. 3) It's unclear why message field 6 should be called "key" - a variable-length field is more likely to be described as "attributes" or "metadata", and 1-byte field 3 would be called "flags" instead of "attributes". 4) Field 8 is called "payload" under message format and "value" under on-disk format. Changed to payload in both places. You can merge this pull request into a Git repository by running: $ git pull https://github.com/davek2/kafka trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/1419.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1419 commit 1d88b8d48cdfe67989bebf239f7588ca24e961b6 Author: JoeDate: 2016-05-24T00:32:04Z Allocate 2 attribute bits for payload format > Message format needs to identify serializer > --- > > Key: KAFKA-3744 > URL: https://issues.apache.org/jira/browse/KAFKA-3744 > Project: Kafka > Issue Type: Improvement >Reporter: David Kay >Priority: Minor > > https://issues.apache.org/jira/browse/KAFKA-3698 was recently resolved with > https://github.com/apache/kafka/commit/27a19b964af35390d78e1b3b50bc03d23327f4d0. > But Kafka documentation on message formats needs to be more explicit for new > users. Section 1.3 Step 4 says: "Send some messages" and takes lines of text > from the command line. Beginner's guide > (http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign > Slide 104 says: > {noformat} >Kafka does not care about data format of msg payload >Up to developer to handle serialization/deserialization > Common choices: Avro, JSON > {noformat} > If one producer sends lines of console text, another producer sends Avro, a > third producer sends JSON, and a fourth sends CBOR, how does the consumer > identify which deserializer to use for the payload? The commit includes an > opaque K byte Key that could potentially include a codec identifier, but > provides no guidance on how to use it: > {quote} > "Leaving the key and value opaque is the right decision: there is a great > deal of progress being made on serialization libraries right now, and any > particular choice is unlikely to be right for all uses. Needless to say a > particular application using Kafka would likely mandate a particular > serialization type as part of its usage." > {quote} > Mandating any particular serialization is as unrealistic as mandating a > single mime-type for all web content. There must be a way to signal the > serialization used to produce this message's V byte payload, and documenting > the existence of even a rudimentary codec registry with a few values (text, > Avro, JSON, CBOR) would establish the pattern to be used for future > serialization libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)