Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Yasuhiro Matsuda Fri, 24 Jul 2015 10:10:46 -0700

The goal of this KIP is to provide a lightweight/embeddable streaming
framework, and allows Kafka users to start using stream processing easily. DSL
is not covered in this KIP. But, DSL is a very attractive option to have.


> In the proposed KafkaProcessor API, there is no interface like Collector
that allows users to send messages to. Why is that?

It is not stated in the KIP, but Context provides a simple interface to a
producer.

>Won’t it be simpler if the process() API just takes in the ConsumerRecord
as the input instead of a tuple of (topic, key, value)?

If Kafka implement a simple DSL something like the one in Spark, I think
ConsumerRecord may not be the most convenient thing for the framework or
the most intuitive thing for users. I don't think we need "topic" in the
arguments. Think about a most simple application, all it needs is a key and
a value. That makes the API simpler. If the application needs to access
more info (topic, offset), Context should provide them.


On Fri, Jul 24, 2015 at 12:57 AM, Yi Pan <nickpa...@gmail.com> wrote:

> Hi, Guozhang,
>
> Thanks for starting this. I took a quick look and had the following
> thoughts to share:
>
> - In the proposed KafkaProcessor API, there is no interface like Collector
> that allows users to send messages to. Why is that? Is the idea to
> initialize the producer once and re-use it in the processor? And if there
> are many KStreamThreads in the process, are there going to be many
> instances of KafkaProducer although all outputs are sending to the same
> Kafka cluster?
>
> - Won’t it be simpler if the process() API just takes in the ConsumerRecord
> as the input instead of a tuple of (topic, key, value)?
>
> - Also, the input only indicates the topic of a message. What if the stream
> task needs to consume and produce messages from/to multiple Kafka clusters?
> To support that case, there should be a system/cluster name in both input
> and output as well.
>
> - How are the output messages handled? There does not seem to have an
> interface that allows user to send an output messages to multiple output
> Kafka clusters.
>
> - It seems the proposed model also assumes one thread per processor. What
> becomes thread-local and what are shared among processors? Is the proposed
> model targeting to have the consumers/producers become thread-local
> instances within each KafkaProcessor? What’s the cost associated with this
> model?
>
> - One more important issue: how do we plug-in client-side partition
> management logic? Considering about the use case where the stream task
> needs to consume from multiple Kafka clusters, I am not even sure that we
> can rely on Kafka broker to maintain the consumer group membership? Maybe
> we still can get the per cluster consumer group membership and partitions.
> However, in this case, we truly need a client-side plugin partition
> management logic to determine how to assign partitions in different Kafka
> clusters to consumers (i.e. consumers for cluster1.topic1.p1 and
> cluster2.topic2.p1 has to be assigned together to one KafkaProcessor for
> processing). Based on the full information about (group members, all topic
> partitions) in all Kafka clusters with input topics, there should be two
> levels of partition management policies: a) how to group all topic
> partitions in all Kafka clusters to processor groups (i.e. the same concept
> as Task group in Samza); b) how to assign the processor groups to group
> members. Note if a processor group includes topic partitions from more than
> one Kafka clusters, it has to be assigned to the common group members in
> all relevant Kafka clusters. This can not be done just by the brokers in a
> single Kafka cluster.
>
> - It seems that the intention of this KIP is also trying to put SQL/DSL
> libraries into Kafka. Why is it? Shouldn't Kafka be more focused on hiding
> system-level integration details and leave it open for any additional
> modules outside the Kafka core to enrich the functionality that are
> user-facing?
>
> Just a few quick cents. Thanks a lot!
>
> -Yi
>
> On Fri, Jul 24, 2015 at 12:12 AM, Neha Narkhede <n...@confluent.io> wrote:
>
> > Ewen:
> >
> > * I think trivial filtering and aggregation on a single stream usually
> work
> > > fine with this model.
> >
> >
> > The way I see this, the process() API is an abstraction for
> > message-at-a-time computations. In the future, you could imagine
> providing
> > a simple DSL layer on top of the process() API that provides a set of
> APIs
> > for stream processing operations on sets of messages like joins, windows
> > and various aggregations.
> >
> > * Spark (and presumably
> > > spark streaming) is supposed to get a big win by handling shuffles such
> > > that the data just stays in cache and never actually hits disk, or at
> > least
> > > hits disk in the background. Will we take a hit because we always write
> > to
> > > Kafka?
> >
> >
> > The goal isn't so much about forcing materialization of intermediate
> > results into Kafka but designing the API to integrate with Kafka to allow
> > such materialization, wherever that might be required. The downside with
> > other stream processing frameworks is that they have weak integration
> with
> > Kafka where interaction with Kafka is only at the endpoints of processing
> > (first input, final output). Any intermediate operations that might
> benefit
> > from persisting intermediate results into Kafka are forced to be broken
> up
> > into 2 separate topologies/plans/stages of processing that lead to more
> > jobs. The implication is that now the set of stream processing operations
> > that should really have lived in one job per application is now split up
> > across several piecemeal jobs that need to be monitored, managed and
> > operated separately. The APIs should still allows in-memory storage of
> > intermediate results where they make sense.
> >
> > Jiangjie,
> >
> > I just took a quick look at the KIP, is it very similar to mirror maker
> > > with message handler?
> >
> >
> > Not really. I wouldn't say it is similar, but mirror maker is a special
> > instance of using copycat with Kafka source, sink + optionally the
> > process() API. I can imagine replacing the MirrorMaker, in the due course
> > of time, with copycat + process().
> >
> > Thanks,
> > Neha
> >
> > On Thu, Jul 23, 2015 at 11:32 PM, Jiangjie Qin <j...@linkedin.com.invalid
> >
> > wrote:
> >
> > > Hey Guozhang,
> > >
> > > I just took a quick look at the KIP, is it very similar to mirror maker
> > > with message handler?
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> > e...@confluent.io
> > > >
> > > wrote:
> > >
> > > > Just some notes on the KIP doc itself:
> > > >
> > > > * It'd be useful to clarify at what point the plain consumer + custom
> > > code
> > > > + producer breaks down. I think trivial filtering and aggregation on
> a
> > > > single stream usually work fine with this model. Anything where you
> > need
> > > > more complex joins, windowing, etc. are where it breaks down. I think
> > > most
> > > > interesting applications require that functionality, but it's helpful
> > to
> > > > make this really clear in the motivation -- right now, Kafka only
> > > provides
> > > > the lowest level plumbing for stream processing applications, so most
> > > > interesting apps require very heavyweight frameworks.
> > > > * I think the feature comparison of plain producer/consumer, stream
> > > > processing frameworks, and this new library is a good start, but we
> > might
> > > > want something more thorough and structured, like a feature matrix.
> > Right
> > > > now it's hard to figure out exactly how they relate to each other.
> > > > * I'd personally push the library vs. framework story very strongly
> --
> > > the
> > > > total buy-in and weak integration story of stream processing
> frameworks
> > > is
> > > > a big downside and makes a library a really compelling (and currently
> > > > unavailable, as far as I am aware) alternative.
> > > > * Comment about in-memory storage of other frameworks is interesting
> --
> > > it
> > > > is specific to the framework, but is supposed to also give
> performance
> > > > benefits. The high-level functional processing interface would allow
> > for
> > > > combining multiple operations when there's no shuffle, but when there
> > is
> > > a
> > > > shuffle, we'll always be writing to Kafka, right? Spark (and
> presumably
> > > > spark streaming) is supposed to get a big win by handling shuffles
> such
> > > > that the data just stays in cache and never actually hits disk, or at
> > > least
> > > > hits disk in the background. Will we take a hit because we always
> write
> > > to
> > > > Kafka?
> > > > * I really struggled with the structure of the KIP template with
> > Copycat
> > > > because the flow doesn't work well for proposals like this. They
> aren't
> > > as
> > > > concrete changes as the KIP template was designed for. I'd completely
> > > > ignore that template in favor of optimizing for clarity if I were
> you.
> > > >
> > > > -Ewen
> > > >
> > > > On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang <wangg...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I just posted KIP-28: Add a transform client for data processing
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
> > > > > >
> > > > > .
> > > > >
> > > > > The wiki page does not yet have the full design / implementation
> > > details,
> > > > > and this email is to kick-off the conversation on whether we should
> > add
> > > > > this new client with the described motivations, and if yes what
> > > features
> > > > /
> > > > > functionalities should be included.
> > > > >
> > > > > Looking forward to your feedback!
> > > > >
> > > > > -- Guozhang
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > Ewen
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Neha
> >
>

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Reply via email to