Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Guozhang Wang Mon, 27 Jul 2015 23:18:02 -0700

Hi Adi,

Just to clarify, the cmdline tool would be used, as stated in the wiki
page, to run the client library "as a process", which is still far away
from a "service". It is just like what we have for kafka-console-producer,
kafka-console-consumer, kafka-mirror-maker, etc today.


Guozhang

On Mon, Jul 27, 2015 at 10:46 PM, Aditya Auradkar <
[email protected]> wrote:

> +1 on comparison with existing solutions. On a high level, it seems nice to
> have a transform library inside Kafka.. a lot of the building blocks are
> already there to build a stream processing framework. However the details
> are tricky to get right I think this discussion will get a lot more
> interesting when we have something concrete to look at. I'm +1 for the
> general idea.
> How far away are we from having something a prototype patch to play with?
>
> Couple of observations:
> - Since the input source for each processor is always Kafka, you get basic
> client side partition management out of the box it use the high level
> consumer.
> - The KIP states that cmd line tools will be provided to deploy as a
> separate service. Is the proposed scope limited to providing a library with
> which makes it possible build stream-processing-as- a-service or provide
> such a service within Kafka itself?
>
> Aditya
>
> On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira <[email protected]>
> wrote:
>
> > Hi,
> >
> > Since we will be discussing KIP-28 in the call tomorrow, can you
> > update the KIP with the feature-comparison with  existing solutions?
> > I admit that I do not see a need for single-event-producer-consumer
> > pair (AKA Flume Interceptor). I've seen tons of people implement such
> > apps in the last year, and it seemed easy. Now, perhaps we were doing
> > it all wrong... but I'd like to know how :)
> >
> > If we are talking about a bigger story (i.e. DSL, real
> > stream-processing, etc), thats a different discussion. I've seen a
> > bunch of misconceptions about SparkStreaming in this discussion, and I
> > have some thoughts in that regard, but I'd rather not go into that if
> > thats outside the scope of this KIP.
> >
> > Gwen
> >
> >
> > On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang <[email protected]>
> wrote:
> > > Hi Ewen,
> > >
> > > Replies inlined.
> > >
> > > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> > [email protected]>
> > > wrote:
> > >
> > >> Just some notes on the KIP doc itself:
> > >>
> > >> * It'd be useful to clarify at what point the plain consumer + custom
> > code
> > >> + producer breaks down. I think trivial filtering and aggregation on a
> > >> single stream usually work fine with this model. Anything where you
> need
> > >> more complex joins, windowing, etc. are where it breaks down. I think
> > most
> > >> interesting applications require that functionality, but it's helpful
> to
> > >> make this really clear in the motivation -- right now, Kafka only
> > provides
> > >> the lowest level plumbing for stream processing applications, so most
> > >> interesting apps require very heavyweight frameworks.
> > >>
> > >
> > > I think for users to efficiently express complex logic like joins
> > > windowing, etc, a higher-level programming interface beyond the
> process()
> > > interface would definitely be better, but that does not necessarily
> > require
> > > a "heavyweight" frameworks, which usually includes more than just the
> > > high-level functional programming model. I would argue that an
> > alternative
> > > solution would better be provided for users who want some high-level
> > > programming interface but not a heavyweight stream processing framework
> > > that include the processor library plus another DSL layer on top of it.
> > >
> > >
> > >
> > >> * I think the feature comparison of plain producer/consumer, stream
> > >> processing frameworks, and this new library is a good start, but we
> > might
> > >> want something more thorough and structured, like a feature matrix.
> > Right
> > >> now it's hard to figure out exactly how they relate to each other.
> > >>
> > >
> > > Cool, I can do that.
> > >
> > >
> > >> * I'd personally push the library vs. framework story very strongly --
> > the
> > >> total buy-in and weak integration story of stream processing
> frameworks
> > is
> > >> a big downside and makes a library a really compelling (and currently
> > >> unavailable, as far as I am aware) alternative.
> > >>
> > >
> > > Are you suggesting there are still some content missing about the
> > > motivations of adding the proposed library in the wiki page?
> > >
> > >
> > >> * Comment about in-memory storage of other frameworks is interesting
> --
> > it
> > >> is specific to the framework, but is supposed to also give performance
> > >> benefits. The high-level functional processing interface would allow
> for
> > >> combining multiple operations when there's no shuffle, but when there
> > is a
> > >> shuffle, we'll always be writing to Kafka, right? Spark (and
> presumably
> > >> spark streaming) is supposed to get a big win by handling shuffles
> such
> > >> that the data just stays in cache and never actually hits disk, or at
> > least
> > >> hits disk in the background. Will we take a hit because we always
> write
> > to
> > >> Kafka?
> > >>
> > >
> > > I agree with Neha's comments here. One more point I want to make is
> > > materializing to Kafka is not necessarily much worse than keeping data
> in
> > > memory if the downstream consumption is caught up such that most of the
> > > reads will be hitting file cache. I remember Samza has illustrated that
> > > under such scenarios its throughput is actually quite comparable to
> Spark
> > > Streaming / Storm.
> > >
> > >
> > >> * I really struggled with the structure of the KIP template with
> Copycat
> > >> because the flow doesn't work well for proposals like this. They
> aren't
> > as
> > >> concrete changes as the KIP template was designed for. I'd completely
> > >> ignore that template in favor of optimizing for clarity if I were you.
> > >>
> > >> -Ewen
> > >>
> > >> On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang <[email protected]>
> > wrote:
> > >>
> > >> > Hi all,
> > >> >
> > >> > I just posted KIP-28: Add a transform client for data processing
> > >> > <
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
> > >> > >
> > >> > .
> > >> >
> > >> > The wiki page does not yet have the full design / implementation
> > details,
> > >> > and this email is to kick-off the conversation on whether we should
> > add
> > >> > this new client with the described motivations, and if yes what
> > features
> > >> /
> > >> > functionalities should be included.
> > >> >
> > >> > Looking forward to your feedback!
> > >> >
> > >> > -- Guozhang
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Thanks,
> > >> Ewen
> > >>
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

Reply via email to