Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Guozhang Wang Fri, 24 Jul 2015 10:17:45 -0700

Hi Yi,

Inlined.


On Fri, Jul 24, 2015 at 12:57 AM, Yi Pan <nickpa...@gmail.com> wrote:

> Hi, Guozhang,
>
> Thanks for starting this. I took a quick look and had the following
> thoughts to share:
>
> - In the proposed KafkaProcessor API, there is no interface like Collector
> that allows users to send messages to. Why is that? Is the idea to
> initialize the producer once and re-use it in the processor? And if there
> are many KStreamThreads in the process, are there going to be many
> instances of KafkaProducer although all outputs are sending to the same
> Kafka cluster?
>
>
The API mentioned in the wiki page is neither final nor comprehensive, it
is just for illustrating its usage for replacing the producer + consumer
APIs. I will try to add the first draft of the full APIs to the wiki page
later, together with a prototype implementation of such APIs. And yes, the
send() function should definitely be supported.

Regarding whether we should have multiple producer / consumer instances or
not within a single processor instance, as mentioned in the wiki it is not
fully decided yet and again, I would like to add such content to the KIP
proposal once we have a prototype illustrating the design so that people
can have a better idea and discuss over it.


> - Won’t it be simpler if the process() API just takes in the ConsumerRecord
> as the input instead of a tuple of (topic, key, value)?
>
>
I think I agree.


> - Also, the input only indicates the topic of a message. What if the stream
> task needs to consume and produce messages from/to multiple Kafka clusters?
> To support that case, there should be a system/cluster name in both input
> and output as well.
>
>
Yeah that is a good question, I agree that in the final state the processor
context should include such record metadata as well as partition-id,
offset, etc.


> - How are the output messages handled? There does not seem to have an
> interface that allows user to send an output messages to multiple output
> Kafka clusters.
>
>
See above.


> - It seems the proposed model also assumes one thread per processor. What
> becomes thread-local and what are shared among processors? Is the proposed
> model targeting to have the consumers/producers become thread-local
> instances within each KafkaProcessor? What’s the cost associated with this
> model?
>
>
We do not assume one thread per processor, I think the name of
KStreamThread would be a bit misleading here: we can definitely spawn more
threads within this "main thread" of the process if we decided to do so in
the system design.


> - One more important issue: how do we plug-in client-side partition
> management logic? Considering about the use case where the stream task
> needs to consume from multiple Kafka clusters, I am not even sure that we
> can rely on Kafka broker to maintain the consumer group membership? Maybe
> we still can get the per cluster consumer group membership and partitions.
> However, in this case, we truly need a client-side plugin partition
> management logic to determine how to assign partitions in different Kafka
> clusters to consumers (i.e. consumers for cluster1.topic1.p1 and
> cluster2.topic2.p1 has to be assigned together to one KafkaProcessor for
> processing). Based on the full information about (group members, all topic
> partitions) in all Kafka clusters with input topics, there should be two
> levels of partition management policies: a) how to group all topic
> partitions in all Kafka clusters to processor groups (i.e. the same concept
> as Task group in Samza); b) how to assign the processor groups to group
> members. Note if a processor group includes topic partitions from more than
> one Kafka clusters, it has to be assigned to the common group members in
> all relevant Kafka clusters. This can not be done just by the brokers in a
> single Kafka cluster.
>
>
Yes, the current broker-side partition assignment is not flexible enough
and we are considering to change the protocol so that to allow clients to
assign partitions themselves; and depending on the system design we may
handle the two-level assignment differently, for example, if we use one
consumer for each incoming Kafka cluster within the process instance, and
have multiple processor threads that reads the data from the shared
consumer, then we besides consumer-group partition assignment we also need
to determine the allocation of partitions into the processor threads.


> - It seems that the intention of this KIP is also trying to put SQL/DSL
> libraries into Kafka. Why is it? Shouldn't Kafka be more focused on hiding
> system-level integration details and leave it open for any additional
> modules outside the Kafka core to enrich the functionality that are
> user-facing?
>
>
I was not trying to push SQL / DSL into Kafka, but I do want to bring this
up for discussion as "what features should be included in this client
library".

Just a few quick cents. Thanks a lot!
>
> -Yi
>
> On Fri, Jul 24, 2015 at 12:12 AM, Neha Narkhede <n...@confluent.io> wrote:
>
> > Ewen:
> >
> > * I think trivial filtering and aggregation on a single stream usually
> work
> > > fine with this model.
> >
> >
> > The way I see this, the process() API is an abstraction for
> > message-at-a-time computations. In the future, you could imagine
> providing
> > a simple DSL layer on top of the process() API that provides a set of
> APIs
> > for stream processing operations on sets of messages like joins, windows
> > and various aggregations.
> >
> > * Spark (and presumably
> > > spark streaming) is supposed to get a big win by handling shuffles such
> > > that the data just stays in cache and never actually hits disk, or at
> > least
> > > hits disk in the background. Will we take a hit because we always write
> > to
> > > Kafka?
> >
> >
> > The goal isn't so much about forcing materialization of intermediate
> > results into Kafka but designing the API to integrate with Kafka to allow
> > such materialization, wherever that might be required. The downside with
> > other stream processing frameworks is that they have weak integration
> with
> > Kafka where interaction with Kafka is only at the endpoints of processing
> > (first input, final output). Any intermediate operations that might
> benefit
> > from persisting intermediate results into Kafka are forced to be broken
> up
> > into 2 separate topologies/plans/stages of processing that lead to more
> > jobs. The implication is that now the set of stream processing operations
> > that should really have lived in one job per application is now split up
> > across several piecemeal jobs that need to be monitored, managed and
> > operated separately. The APIs should still allows in-memory storage of
> > intermediate results where they make sense.
> >
> > Jiangjie,
> >
> > I just took a quick look at the KIP, is it very similar to mirror maker
> > > with message handler?
> >
> >
> > Not really. I wouldn't say it is similar, but mirror maker is a special
> > instance of using copycat with Kafka source, sink + optionally the
> > process() API. I can imagine replacing the MirrorMaker, in the due course
> > of time, with copycat + process().
> >
> > Thanks,
> > Neha
> >
> > On Thu, Jul 23, 2015 at 11:32 PM, Jiangjie Qin <j...@linkedin.com.invalid
> >
> > wrote:
> >
> > > Hey Guozhang,
> > >
> > > I just took a quick look at the KIP, is it very similar to mirror maker
> > > with message handler?
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> > e...@confluent.io
> > > >
> > > wrote:
> > >
> > > > Just some notes on the KIP doc itself:
> > > >
> > > > * It'd be useful to clarify at what point the plain consumer + custom
> > > code
> > > > + producer breaks down. I think trivial filtering and aggregation on
> a
> > > > single stream usually work fine with this model. Anything where you
> > need
> > > > more complex joins, windowing, etc. are where it breaks down. I think
> > > most
> > > > interesting applications require that functionality, but it's helpful
> > to
> > > > make this really clear in the motivation -- right now, Kafka only
> > > provides
> > > > the lowest level plumbing for stream processing applications, so most
> > > > interesting apps require very heavyweight frameworks.
> > > > * I think the feature comparison of plain producer/consumer, stream
> > > > processing frameworks, and this new library is a good start, but we
> > might
> > > > want something more thorough and structured, like a feature matrix.
> > Right
> > > > now it's hard to figure out exactly how they relate to each other.
> > > > * I'd personally push the library vs. framework story very strongly
> --
> > > the
> > > > total buy-in and weak integration story of stream processing
> frameworks
> > > is
> > > > a big downside and makes a library a really compelling (and currently
> > > > unavailable, as far as I am aware) alternative.
> > > > * Comment about in-memory storage of other frameworks is interesting
> --
> > > it
> > > > is specific to the framework, but is supposed to also give
> performance
> > > > benefits. The high-level functional processing interface would allow
> > for
> > > > combining multiple operations when there's no shuffle, but when there
> > is
> > > a
> > > > shuffle, we'll always be writing to Kafka, right? Spark (and
> presumably
> > > > spark streaming) is supposed to get a big win by handling shuffles
> such
> > > > that the data just stays in cache and never actually hits disk, or at
> > > least
> > > > hits disk in the background. Will we take a hit because we always
> write
> > > to
> > > > Kafka?
> > > > * I really struggled with the structure of the KIP template with
> > Copycat
> > > > because the flow doesn't work well for proposals like this. They
> aren't
> > > as
> > > > concrete changes as the KIP template was designed for. I'd completely
> > > > ignore that template in favor of optimizing for clarity if I were
> you.
> > > >
> > > > -Ewen
> > > >
> > > > On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang <wangg...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I just posted KIP-28: Add a transform client for data processing
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
> > > > > >
> > > > > .
> > > > >
> > > > > The wiki page does not yet have the full design / implementation
> > > details,
> > > > > and this email is to kick-off the conversation on whether we should
> > add
> > > > > this new client with the described motivations, and if yes what
> > > features
> > > > /
> > > > > functionalities should be included.
> > > > >
> > > > > Looking forward to your feedback!
> > > > >
> > > > > -- Guozhang
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > Ewen
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Neha
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Reply via email to