Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Neha Narkhede Fri, 24 Jul 2015 00:13:30 -0700

Ewen:

* I think trivial filtering and aggregation on a single stream usually work
> fine with this model.

The way I see this, the process() API is an abstraction for
message-at-a-time computations. In the future, you could imagine providing
a simple DSL layer on top of the process() API that provides a set of APIs
for stream processing operations on sets of messages like joins, windows
and various aggregations.

* Spark (and presumably
> spark streaming) is supposed to get a big win by handling shuffles such
> that the data just stays in cache and never actually hits disk, or at least
> hits disk in the background. Will we take a hit because we always write to
> Kafka?

The goal isn't so much about forcing materialization of intermediate
results into Kafka but designing the API to integrate with Kafka to allow
such materialization, wherever that might be required. The downside with
other stream processing frameworks is that they have weak integration with
Kafka where interaction with Kafka is only at the endpoints of processing
(first input, final output). Any intermediate operations that might benefit
from persisting intermediate results into Kafka are forced to be broken up
into 2 separate topologies/plans/stages of processing that lead to more
jobs. The implication is that now the set of stream processing operations
that should really have lived in one job per application is now split up
across several piecemeal jobs that need to be monitored, managed and
operated separately. The APIs should still allows in-memory storage of
intermediate results where they make sense.

Jiangjie,

I just took a quick look at the KIP, is it very similar to mirror maker
> with message handler?

Not really. I wouldn't say it is similar, but mirror maker is a special
instance of using copycat with Kafka source, sink + optionally the
process() API. I can imagine replacing the MirrorMaker, in the due course
of time, with copycat + process().

Thanks,
Neha

On Thu, Jul 23, 2015 at 11:32 PM, Jiangjie Qin <j...@linkedin.com.invalid>
wrote:

> Hey Guozhang,
>
> I just took a quick look at the KIP, is it very similar to mirror maker
> with message handler?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <e...@confluent.io
> >
> wrote:
>
> > Just some notes on the KIP doc itself:
> >
> > * It'd be useful to clarify at what point the plain consumer + custom
> code
> > + producer breaks down. I think trivial filtering and aggregation on a
> > single stream usually work fine with this model. Anything where you need
> > more complex joins, windowing, etc. are where it breaks down. I think
> most
> > interesting applications require that functionality, but it's helpful to
> > make this really clear in the motivation -- right now, Kafka only
> provides
> > the lowest level plumbing for stream processing applications, so most
> > interesting apps require very heavyweight frameworks.
> > * I think the feature comparison of plain producer/consumer, stream
> > processing frameworks, and this new library is a good start, but we might
> > want something more thorough and structured, like a feature matrix. Right
> > now it's hard to figure out exactly how they relate to each other.
> > * I'd personally push the library vs. framework story very strongly --
> the
> > total buy-in and weak integration story of stream processing frameworks
> is
> > a big downside and makes a library a really compelling (and currently
> > unavailable, as far as I am aware) alternative.
> > * Comment about in-memory storage of other frameworks is interesting --
> it
> > is specific to the framework, but is supposed to also give performance
> > benefits. The high-level functional processing interface would allow for
> > combining multiple operations when there's no shuffle, but when there is
> a
> > shuffle, we'll always be writing to Kafka, right? Spark (and presumably
> > spark streaming) is supposed to get a big win by handling shuffles such
> > that the data just stays in cache and never actually hits disk, or at
> least
> > hits disk in the background. Will we take a hit because we always write
> to
> > Kafka?
> > * I really struggled with the structure of the KIP template with Copycat
> > because the flow doesn't work well for proposals like this. They aren't
> as
> > concrete changes as the KIP template was designed for. I'd completely
> > ignore that template in favor of optimizing for clarity if I were you.
> >
> > -Ewen
> >
> > On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang <wangg...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > I just posted KIP-28: Add a transform client for data processing
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
> > > >
> > > .
> > >
> > > The wiki page does not yet have the full design / implementation
> details,
> > > and this email is to kick-off the conversation on whether we should add
> > > this new client with the described motivations, and if yes what
> features
> > /
> > > functionalities should be included.
> > >
> > > Looking forward to your feedback!
> > >
> > > -- Guozhang
> > >
> >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>

-- 
Thanks,
Neha

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Reply via email to