Re: [DISCUSS] KIP-82 - Add Record Headers

Roger Hoover Tue, 08 Nov 2016 17:41:37 -0800

Sorry I didn't read the KIP carefully enough and thought there was more
difference between 5a and 5c.  I now see that in 5a, the headers section is
already defined as a sub-protocol that (I assume) does not need to be
parsed at the broker.


The main difference, as pointed out, is whether the broker would ever
want/need to act on non-standard headers.  In any case, if there's a good
default serializer, there should be minimal need for custom serializers.
It's a fairly minor point.

The main decision on headers, it seems, is how to handle namespacing them.
With strings, each header can carry it's own namespace it's own via prefix
(e.g. grpc-* for gRPC on top of HTTP2).

For ints, the KIP suggests a global registry for headers and Magnus
proposed a decentralized pluggable mapping.  In either case, I'd suggest a
tweak to the model.  I think it would help to identify a header by
(namespaceId, fieldId) similar to the Protobuf model.  If there's appetite
for a global registry, organizations would only have to register a single
namespace and from then on add headers at will.  For the pluggable mapping,
users would only have to providing a mapping for each namespace they use,
rather than a mapping for each specific header.  It's up to the
apps/plugins within a namespace to maintain their own list of field ids (as
with a protobuf definition).

My main concern (similar to Nacho) is to see a common client API that
allows per message headers.  I'll let others debate the performance
tradeoffs of strings vs. ints.

Roger

On Tue, Nov 8, 2016 at 10:54 AM, Nacho Solis <nso...@linkedin.com.invalid>
wrote:

> From Roger's description:
>
> 5a- separate metadata field, built in serialization
> 5c- separate metadata field, custom serialization
> 5b- custom serialization (inside V)
> 5d- built in serialization (inside V)
>
> I added 5d for completeness.
>
> From this perspective I would choose
>
> 5a > 5c > 5d > 5b
>
> In all of these cases, I would like to make sure that the broker
> (eventually) has the ability to deal with the headers.
>
> - Custom serialization
> I'm not in favor of custom serialization for the whole metadata/header
> block. This is because I think that we will have multiple headers or
> metadata blobs by different teams (either internal or external), with
> different goals and different requirements. They will work independently of
> each other. They will not try to coordinate a common format. The audit team
> and the security team and the performance team and the monitoring team and
> the application might not work of the same needs.  The one need that they
> share is sending data along with the message.   To put all of their data
> together, we will need a common header system.
>
> Obviously the kafka team (the team in charge of running the kafka system,
> say, the linkedin-kafka-team) can write a wrapper or a custom serializer
> that somehow provides a set of functions for all these teams to work
> together.  So technically we're not limited. However, if we want to share
> our plugins with some other team, the acme-kafka-team then we would have to
> have compatible serializers.  This is doable but not an easy task.
>
> From my point of view we have 2 options:
> A- We use a built in serializer for the headers. Each plugin/module can
> then serialize their internal data however they want, but the set format
> itself is common.  This would allow us to work together. Plugins are shared
> and evolve from collective efforts.
> B1- We use a custom serializer for headers.  We have balkanization of
> headers and no cooperation
> B2- We use a custom serializer for headers.  One such serializer becomes
> popular, effectively providing a wrapper to the open source clients that
> provides header support. Various companies/entities start using this and
> form a community around this. Plugins are shared and evolve from collective
> efforts.
>
> I believe that given B2 offers collective power, it will overtake B1.
> Effectively, we would reach the same situation as A, but will take a little
> more time and will make the code more difficult to manage.
>
>
> Isn't this the same reason Connect is inside Apache Kafka?  And now there
> are a set of Kafka Connectors (https://www.confluent.io/prod
> uct/connectors/)
> that take advantage of the fact that Connect defines a common framework.
>
>
> To be clear, I think my main goal would be for Apache Kafka to offer a
> Client API to add and remove headers per message.  If we can offer this as
> a standard (in other words, part of Apache Kafka open source), then we have
> achieved 80% of the work. The community will benefit as a whole.   If this
> is done via a container inside V, if it's done natively in the protocol, if
> we offer a way to override the serializer, if the broker can understand the
> headers, etc. are secondary (though I obviously have opinions about those;
> no, yes, no, yes).
>
> If we don't want to include this into Apache open source, then the people
> that want it will have to write their own (if they haven't done so
> already). With time, they will end up writing a common wrapper, the common
> wrapper will get shared plugins, people will start using the shared
> plugins, the wrapper will become more popular than the regular clients and
> eventually there will be a fork or a merge back.
>
> Yes, it is possible not everybody wants headers, so far, we haven't met
> many (any?) of those people. At most we've seen people that are happy with
> heir own implementation or hack around the issue.  I'm pretty certain that
> if they had had headers to start with they wouldn't be in the situation
> they are today.
>
> Even if the current people don't want to change from their current system;
> the new people will probably use it. LinkedIn for certain would use it.
>
> Make Kafka great again!  [1]
>
>
> Nacho
>
> [1] to be clear that's a joke... it's election day in the US
>
>
> On Tue, Nov 8, 2016 at 9:48 AM, radai <radai.rosenbl...@gmail.com> wrote:
>
> > both 5a and 5c would involve a wire format change, so any arguments about
> > needing an upgrade path bumping protocol version etc apply equally to
> both.
> > so the "cost" (in terms of impact of a wire format change) is the same.
> >
> > 5c, to me, means doing all the work (more exactly incurring all the cost)
> > but getting very few of the benefits. a universal, agreed-upon structure
> > for headers (specifically their keys) is, in my opinion, a basic
> > requirement to reap the full benefits of headers - an active ecosystem of
> > composable, re-usable, 3rd-party extensions to kafka.
> >
> > as for what exactly those keys are (int vs string) - since using ints is
> > such a giant sticking point and given kafka usually operates with
> batching
> > and compression and does not achieve high-enough iops for it to make a
> > noticeable difference in CPU consumption I'm willing to go with string
> keys
> > just to get that out of the way.
> >
> > On Mon, Nov 7, 2016 at 11:51 PM, Michael Pearce <michael.pea...@ig.com>
> > wrote:
> >
> > > +1 on this slimmer version of our proposal
> > >
> > > I def think the Id space we can reduce from the proposed int32(4bytes)
> > > down to int16(2bytes) it saves on space and as headers we wouldn't
> expect
> > > the number of headers being used concurrently being that high.
> > >
> > > I would wonder if we should make the value byte array length still
> int32
> > > though as This is the standard Max array length in Java saying that it
> > is a
> > > header and I guess limiting the size is sensible and would work for all
> > the
> > > use cases we have in mind so happy with limiting this.
> > >
> > > Do people generally concur on Magnus's slimmer version? Anyone see any
> > > issues if we moved from int32 to int16?
> > >
> > > Re configurable ids per plugin over a global registry also would work
> for
> > > us.  As such if this has better concensus over the proposed global
> > registry
> > > I'd be happy to change that.
> > >
> > > I was already sold on ints over strings for keys ;)
> > >
> > > Cheers
> > > Mike
> > >
> > > ________________________________________
> > > From: Magnus Edenhill <mag...@edenhill.se>
> > > Sent: Monday, November 7, 2016 10:10:21 PM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [DISCUSS] KIP-82 - Add Record Headers
> > >
> > > Hi,
> > >
> > > I'm +1 for adding generic message headers, but I do share the concerns
> > > previously aired on this thread and during the KIP meeting.
> > >
> > > So let me propose a slimmer alternative that does not require any sort
> of
> > > global header registry, does not affect broker performance or
> operations,
> > > and adds as little overhead as possible.
> > >
> > >
> > > Message
> > > ------------
> > > The protocol Message type is extended with a Headers array consting of
> > > Tags, where a Tag is defined as:
> > >    int16 Id
> > >    int16 Len              // binary_data length
> > >    binary_data[Len]  // opaque binary data
> > >
> > >
> > > Ids
> > > ---
> > > The Id space is not centrally managed, so whenever an application needs
> > to
> > > add headers, or use an eco-system plugin that does, its Id allocation
> > will
> > > need to be manually configured.
> > > This moves the allocation concern from the global space down to
> > > organization level and avoids the risk for id conflicts.
> > > Example pseudo-config for some app:
> > >     sometrackerplugin.tag.sourcev3.id=1000
> > >     dbthing.tag.tablename.id=1001
> > >     myschemareg.tag.schemaname.id=1002
> > >     myschemareg.tag.schemaversion.id=1003
> > >
> > >
> > > Each header-writing or header-reading plugin must provide means
> > (typically
> > > through configuration) to specify the tag for each header it uses.
> > Defaults
> > > should be avoided.
> > > A consumer silently ignores tags it does not have a mapping for (since
> > the
> > > binary_data can't be parsed without knowing what it is).
> > >
> > > Id range 0..999 is reserved for future use by the broker and must not
> be
> > > used by plugins.
> > >
> > >
> > >
> > > Broker
> > > ---------
> > > The broker does not process the tags (other than the standard protocol
> > > syntax verification), it simply stores and forwards them as opaque
> data.
> > >
> > > Standard message translation (removal of Headers) kicks in for older
> > > clients.
> > >
> > >
> > > Why not string ids?
> > > -------------------------
> > > String ids might seem like a good idea, but:
> > >  * does not really solve uniqueness
> > >  * consumes a lot of space (2 byte string length + string, per header)
> to
> > > be meaningful
> > >  * doesn't really say anything how to parse the tag's data, so it is in
> > > effect useless on its own.
> > >
> > >
> > > Regards,
> > > Magnus
> > >
> > >
> > >
> > >
> > > 2016-11-07 18:32 GMT+01:00 Michael Pearce <michael.pea...@ig.com>:
> > >
> > > > Hi Roger,
> > > >
> > > > Thanks for the support.
> > > >
> > > > I think the key thing is to have a common key space to make an
> > ecosystem,
> > > > there does have to be some level of contract for people to play
> nicely.
> > > >
> > > > Having map<String, byte[]> or as per current proposed in kip of
> having
> > a
> > > > numerical key space of  map<int, byte[]> is a level of the contract
> > that
> > > > most people would expect.
> > > >
> > > > I think the example in a previous comment someone else made linking
> to
> > > AWS
> > > > blog and also implemented api where originally they didn’t have a
> > header
> > > > space but not they do, where keys are uniform but the value can be
> > > string,
> > > > int, anything is a good example.
> > > >
> > > > Having a custom MetadataSerializer is something we had played with,
> but
> > > > discounted the idea, as if you wanted everyone to work the same way
> in
> > > the
> > > > ecosystem, having to have this also customizable makes it a bit
> harder.
> > > > Think about making the whole message record custom serializable, this
> > > would
> > > > make it fairly tricky (though it would not be impossible) to have
> made
> > > work
> > > > nicely. Having the value customizable we thought is a reasonable
> > tradeoff
> > > > here of flexibility over contract of interaction between different
> > > parties.
> > > >
> > > > Is there a particular case or benefit of having serialization
> > > customizable
> > > > that you have in mind?
> > > >
> > > > Saying this it is obviously something that could be implemented, if
> > there
> > > > is a need. If we did go this avenue I think a defaulted serializer
> > > > implementation should exist so for the 80:20 rule, people can just
> have
> > > the
> > > > broker and clients get default behavior.
> > > >
> > > > Cheers
> > > > Mike
> > > >
> > > > On 11/6/16, 5:25 PM, "radai" <radai.rosenbl...@gmail.com> wrote:
> > > >
> > > >     making header _key_ serialization configurable potentially
> > undermines
> > > > the
> > > >     board usefulness of the feature (any point along the path must be
> > > able
> > > > to
> > > >     read the header keys. the values may be whatever and require more
> > > > intimate
> > > >     knowledge of the code that produced specific headers, but keys
> > should
> > > > be
> > > >     universally readable).
> > > >
> > > >     it would also make it hard to write really portable plugins -
> say i
> > > > wrote a
> > > >     large message splitter/combiner - if i rely on key "largeMessage"
> > and
> > > >     values of the form "1/20" someone who uses (contrived example)
> > > > Map<Byte[],
> > > >     Double> wouldnt be able to re-use my code.
> > > >
> > > >     not the end of a the world within an organization, but
> problematic
> > if
> > > > you
> > > >     want to enable an ecosystem
> > > >
> > > >     On Thu, Nov 3, 2016 at 2:04 PM, Roger Hoover <
> > roger.hoo...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >     >  As others have laid out, I see strong reasons for a common
> > message
> > > >     > metadata structure for the Kafka ecosystem.  In particular,
> I've
> > > > seen that
> > > >     > even within a single organization, infrastructure teams often
> own
> > > the
> > > >     > message metadata while application teams own the
> > application-level
> > > > data
> > > >     > format.  Allowing metadata and content to have different
> > structure
> > > > and
> > > >     > evolve separately is very helpful for this.  Also, I think
> > there's
> > > a
> > > > lot of
> > > >     > value to having a common metadata structure shared across the
> > Kafka
> > > >     > ecosystem so that tools which leverage metadata can more easily
> > be
> > > > shared
> > > >     > across organizations and integrated together.
> > > >     >
> > > >     > The question is, where does the metadata structure belong?
> > Here's
> > > > my take:
> > > >     >
> > > >     > We change the Kafka wire and on-disk format to from a (key,
> > value)
> > > > model to
> > > >     > a (key, metadata, value) model where all three are byte arrays
> > from
> > > > the
> > > >     > brokers point of view.  The primary reason for this is that it
> > > > provides a
> > > >     > backward compatible migration path forward.  Producers can
> start
> > > > populating
> > > >     > metadata fields before all consumers understand the metadata
> > > > structure.
> > > >     > For people who already have custom envelope structures, they
> can
> > > > populate
> > > >     > their existing structure and the new structure for a while as
> > they
> > > > make the
> > > >     > transition.
> > > >     >
> > > >     > We could stop there and let the clients plug in a
> KeySerializer,
> > > >     > MetadataSerializer, and ValueSerializer but I think it is also
> be
> > > > useful to
> > > >     > have a default MetadataSerializer that implements a key-value
> > model
> > > > similar
> > > >     > to AMQP or HTTP headers.  Or we could go even further and
> > > prescribe a
> > > >     > Map<String, byte[]> or Map<String, String> data model for
> headers
> > > in
> > > > the
> > > >     > clients (while still allowing custom serialization of the
> header
> > > data
> > > >     > model).
> > > >     >
> > > >     > I think this would address Radai's concerns:
> > > >     > 1. All client code would not need to be updated to know about
> the
> > > >     > container.
> > > >     > 2. Middleware friendly clients would have a standard header
> data
> > > > model to
> > > >     > work with.
> > > >     > 3. KIP is required both b/c of broker changes and because of
> > client
> > > > API
> > > >     > changes.
> > > >     >
> > > >     > Cheers,
> > > >     >
> > > >     > Roger
> > > >     >
> > > >     >
> > > >     > On Wed, Nov 2, 2016 at 4:38 PM, radai <
> > radai.rosenbl...@gmail.com>
> > > > wrote:
> > > >     >
> > > >     > > my biggest issues with a "standard" wrapper format:
> > > >     > >
> > > >     > > 1. _ALL_ client _CODE_ (as opposed to kafka lib version) must
> > be
> > > > updated
> > > >     > to
> > > >     > > know about the container, because any old naive code trying
> to
> > > > directly
> > > >     > > deserialize its own payload would keel over and die (it needs
> > to
> > > > know to
> > > >     > > deserialize a container, and then dig in there for its
> > payload).
> > > >     > > 2. in order to write middleware-friendly clients that utilize
> > > such
> > > > a
> > > >     > > container one would basically have to write their own
> > > > producer/consumer
> > > >     > API
> > > >     > > on top of the open source kafka one.
> > > >     > > 3. if you were going to go with a wrapper format you really
> > dont
> > > > need to
> > > >     > > bother with a kip (just open source your own client stack
> from
> > #2
> > > > above
> > > >     > so
> > > >     > > others could stop re-inventing it)
> > > >     > >
> > > >     > > On Wed, Nov 2, 2016 at 4:25 PM, James Cheng <
> > > wushuja...@gmail.com>
> > > >     > wrote:
> > > >     > >
> > > >     > > > How exactly would this work? Or maybe that's out of scope
> for
> > > > this
> > > >     > email.
> > > >     > >
> > > >     >
> > > >
> > > >
> > > > The information contained in this email is strictly confidential and
> > for
> > > > the use of the addressee only, unless otherwise indicated. If you are
> > not
> > > > the intended recipient, please do not read, copy, use or disclose to
> > > others
> > > > this message or any attachment. Please also notify the sender by
> > replying
> > > > to this email or by telephone (+44(020 7896 0011) and then delete the
> > > email
> > > > and any copies of it. Opinions, conclusion (etc) that do not relate
> to
> > > the
> > > > official business of this company shall be understood as neither
> given
> > > nor
> > > > endorsed by it. IG is a trading name of IG Markets Limited (a company
> > > > registered in England and Wales, company number 04008957) and IG
> Index
> > > > Limited (a company registered in England and Wales, company number
> > > > 01190902). Registered address at Cannon Bridge House, 25 Dowgate
> Hill,
> > > > London EC4R 2YA. Both IG Markets Limited (register number 195355) and
> > IG
> > > > Index Limited (register number 114059) are authorised and regulated
> by
> > > the
> > > > Financial Conduct Authority.
> > > >
> > > The information contained in this email is strictly confidential and
> for
> > > the use of the addressee only, unless otherwise indicated. If you are
> not
> > > the intended recipient, please do not read, copy, use or disclose to
> > others
> > > this message or any attachment. Please also notify the sender by
> replying
> > > to this email or by telephone (+44(020 7896 0011) and then delete the
> > email
> > > and any copies of it. Opinions, conclusion (etc) that do not relate to
> > the
> > > official business of this company shall be understood as neither given
> > nor
> > > endorsed by it. IG is a trading name of IG Markets Limited (a company
> > > registered in England and Wales, company number 04008957) and IG Index
> > > Limited (a company registered in England and Wales, company number
> > > 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill,
> > > London EC4R 2YA. Both IG Markets Limited (register number 195355) and
> IG
> > > Index Limited (register number 114059) are authorised and regulated by
> > the
> > > Financial Conduct Authority.
> > >
> >
>
>
>
> --
> Nacho (Ignacio) Solis
> Kafka
> nso...@linkedin.com
>

Re: [DISCUSS] KIP-82 - Add Record Headers

Reply via email to