Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Yi Pan Mon, 27 Jul 2015 23:01:02 -0700

Hi, Aditya,

{quote}
- The KIP states that cmd line tools will be provided to deploy as a
separate service. Is the proposed scope limited to providing a library with
which makes it possible build stream-processing-as- a-service or provide
such a service within Kafka itself?
{quote}


There has already been a long discussion happened in Samza mailing list
which partly resulted in this KIP proposal. The basic conclusion was that
this KIP is to build stream processor library that could be used as library
or standalone process. The standalone process may be used as a deployment
method of stream process in a cluster environment, but that would be
outside the scope of this KIP.

-Yi

On Mon, Jul 27, 2015 at 10:46 PM, Aditya Auradkar <
[email protected]> wrote:

> +1 on comparison with existing solutions. On a high level, it seems nice to
> have a transform library inside Kafka.. a lot of the building blocks are
> already there to build a stream processing framework. However the details
> are tricky to get right I think this discussion will get a lot more
> interesting when we have something concrete to look at. I'm +1 for the
> general idea.
> How far away are we from having something a prototype patch to play with?
>
> Couple of observations:
> - Since the input source for each processor is always Kafka, you get basic
> client side partition management out of the box it use the high level
> consumer.
> - The KIP states that cmd line tools will be provided to deploy as a
> separate service. Is the proposed scope limited to providing a library with
> which makes it possible build stream-processing-as- a-service or provide
> such a service within Kafka itself?
>
> Aditya
>
> On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira <[email protected]>
> wrote:
>
> > Hi,
> >
> > Since we will be discussing KIP-28 in the call tomorrow, can you
> > update the KIP with the feature-comparison with  existing solutions?
> > I admit that I do not see a need for single-event-producer-consumer
> > pair (AKA Flume Interceptor). I've seen tons of people implement such
> > apps in the last year, and it seemed easy. Now, perhaps we were doing
> > it all wrong... but I'd like to know how :)
> >
> > If we are talking about a bigger story (i.e. DSL, real
> > stream-processing, etc), thats a different discussion. I've seen a
> > bunch of misconceptions about SparkStreaming in this discussion, and I
> > have some thoughts in that regard, but I'd rather not go into that if
> > thats outside the scope of this KIP.
> >
> > Gwen
> >
> >
> > On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang <[email protected]>
> wrote:
> > > Hi Ewen,
> > >
> > > Replies inlined.
> > >
> > > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> > [email protected]>
> > > wrote:
> > >
> > >> Just some notes on the KIP doc itself:
> > >>
> > >> * It'd be useful to clarify at what point the plain consumer + custom
> > code
> > >> + producer breaks down. I think trivial filtering and aggregation on a
> > >> single stream usually work fine with this model. Anything where you
> need
> > >> more complex joins, windowing, etc. are where it breaks down. I think
> > most
> > >> interesting applications require that functionality, but it's helpful
> to
> > >> make this really clear in the motivation -- right now, Kafka only
> > provides
> > >> the lowest level plumbing for stream processing applications, so most
> > >> interesting apps require very heavyweight frameworks.
> > >>
> > >
> > > I think for users to efficiently express complex logic like joins
> > > windowing, etc, a higher-level programming interface beyond the
> process()
> > > interface would definitely be better, but that does not necessarily
> > require
> > > a "heavyweight" frameworks, which usually includes more than just the
> > > high-level functional programming model. I would argue that an
> > alternative
> > > solution would better be provided for users who want some high-level
> > > programming interface but not a heavyweight stream processing framework
> > > that include the processor library plus another DSL layer on top of it.
> > >
> > >
> > >
> > >> * I think the feature comparison of plain producer/consumer, stream
> > >> processing frameworks, and this new library is a good start, but we
> > might
> > >> want something more thorough and structured, like a feature matrix.
> > Right
> > >> now it's hard to figure out exactly how they relate to each other.
> > >>
> > >
> > > Cool, I can do that.
> > >
> > >
> > >> * I'd personally push the library vs. framework story very strongly --
> > the
> > >> total buy-in and weak integration story of stream processing
> frameworks
> > is
> > >> a big downside and makes a library a really compelling (and currently
> > >> unavailable, as far as I am aware) alternative.
> > >>
> > >
> > > Are you suggesting there are still some content missing about the
> > > motivations of adding the proposed library in the wiki page?
> > >
> > >
> > >> * Comment about in-memory storage of other frameworks is interesting
> --
> > it
> > >> is specific to the framework, but is supposed to also give performance
> > >> benefits. The high-level functional processing interface would allow
> for
> > >> combining multiple operations when there's no shuffle, but when there
> > is a
> > >> shuffle, we'll always be writing to Kafka, right? Spark (and
> presumably
> > >> spark streaming) is supposed to get a big win by handling shuffles
> such
> > >> that the data just stays in cache and never actually hits disk, or at
> > least
> > >> hits disk in the background. Will we take a hit because we always
> write
> > to
> > >> Kafka?
> > >>
> > >
> > > I agree with Neha's comments here. One more point I want to make is
> > > materializing to Kafka is not necessarily much worse than keeping data
> in
> > > memory if the downstream consumption is caught up such that most of the
> > > reads will be hitting file cache. I remember Samza has illustrated that
> > > under such scenarios its throughput is actually quite comparable to
> Spark
> > > Streaming / Storm.
> > >
> > >
> > >> * I really struggled with the structure of the KIP template with
> Copycat
> > >> because the flow doesn't work well for proposals like this. They
> aren't
> > as
> > >> concrete changes as the KIP template was designed for. I'd completely
> > >> ignore that template in favor of optimizing for clarity if I were you.
> > >>
> > >> -Ewen
> > >>
> > >> On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang <[email protected]>
> > wrote:
> > >>
> > >> > Hi all,
> > >> >
> > >> > I just posted KIP-28: Add a transform client for data processing
> > >> > <
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
> > >> > >
> > >> > .
> > >> >
> > >> > The wiki page does not yet have the full design / implementation
> > details,
> > >> > and this email is to kick-off the conversation on whether we should
> > add
> > >> > this new client with the described motivations, and if yes what
> > features
> > >> /
> > >> > functionalities should be included.
> > >> >
> > >> > Looking forward to your feedback!
> > >> >
> > >> > -- Guozhang
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Thanks,
> > >> Ewen
> > >>
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> >
>

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Reply via email to