Hi Adi, Just to clarify, the cmdline tool would be used, as stated in the wiki page, to run the client library "as a process", which is still far away from a "service". It is just like what we have for kafka-console-producer, kafka-console-consumer, kafka-mirror-maker, etc today.
Guozhang On Mon, Jul 27, 2015 at 10:46 PM, Aditya Auradkar < aaurad...@linkedin.com.invalid> wrote: > +1 on comparison with existing solutions. On a high level, it seems nice to > have a transform library inside Kafka.. a lot of the building blocks are > already there to build a stream processing framework. However the details > are tricky to get right I think this discussion will get a lot more > interesting when we have something concrete to look at. I'm +1 for the > general idea. > How far away are we from having something a prototype patch to play with? > > Couple of observations: > - Since the input source for each processor is always Kafka, you get basic > client side partition management out of the box it use the high level > consumer. > - The KIP states that cmd line tools will be provided to deploy as a > separate service. Is the proposed scope limited to providing a library with > which makes it possible build stream-processing-as- a-service or provide > such a service within Kafka itself? > > Aditya > > On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira <gshap...@cloudera.com> > wrote: > > > Hi, > > > > Since we will be discussing KIP-28 in the call tomorrow, can you > > update the KIP with the feature-comparison with existing solutions? > > I admit that I do not see a need for single-event-producer-consumer > > pair (AKA Flume Interceptor). I've seen tons of people implement such > > apps in the last year, and it seemed easy. Now, perhaps we were doing > > it all wrong... but I'd like to know how :) > > > > If we are talking about a bigger story (i.e. DSL, real > > stream-processing, etc), thats a different discussion. I've seen a > > bunch of misconceptions about SparkStreaming in this discussion, and I > > have some thoughts in that regard, but I'd rather not go into that if > > thats outside the scope of this KIP. > > > > Gwen > > > > > > On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang <wangg...@gmail.com> > wrote: > > > Hi Ewen, > > > > > > Replies inlined. > > > > > > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava < > > e...@confluent.io> > > > wrote: > > > > > >> Just some notes on the KIP doc itself: > > >> > > >> * It'd be useful to clarify at what point the plain consumer + custom > > code > > >> + producer breaks down. I think trivial filtering and aggregation on a > > >> single stream usually work fine with this model. Anything where you > need > > >> more complex joins, windowing, etc. are where it breaks down. I think > > most > > >> interesting applications require that functionality, but it's helpful > to > > >> make this really clear in the motivation -- right now, Kafka only > > provides > > >> the lowest level plumbing for stream processing applications, so most > > >> interesting apps require very heavyweight frameworks. > > >> > > > > > > I think for users to efficiently express complex logic like joins > > > windowing, etc, a higher-level programming interface beyond the > process() > > > interface would definitely be better, but that does not necessarily > > require > > > a "heavyweight" frameworks, which usually includes more than just the > > > high-level functional programming model. I would argue that an > > alternative > > > solution would better be provided for users who want some high-level > > > programming interface but not a heavyweight stream processing framework > > > that include the processor library plus another DSL layer on top of it. > > > > > > > > > > > >> * I think the feature comparison of plain producer/consumer, stream > > >> processing frameworks, and this new library is a good start, but we > > might > > >> want something more thorough and structured, like a feature matrix. > > Right > > >> now it's hard to figure out exactly how they relate to each other. > > >> > > > > > > Cool, I can do that. > > > > > > > > >> * I'd personally push the library vs. framework story very strongly -- > > the > > >> total buy-in and weak integration story of stream processing > frameworks > > is > > >> a big downside and makes a library a really compelling (and currently > > >> unavailable, as far as I am aware) alternative. > > >> > > > > > > Are you suggesting there are still some content missing about the > > > motivations of adding the proposed library in the wiki page? > > > > > > > > >> * Comment about in-memory storage of other frameworks is interesting > -- > > it > > >> is specific to the framework, but is supposed to also give performance > > >> benefits. The high-level functional processing interface would allow > for > > >> combining multiple operations when there's no shuffle, but when there > > is a > > >> shuffle, we'll always be writing to Kafka, right? Spark (and > presumably > > >> spark streaming) is supposed to get a big win by handling shuffles > such > > >> that the data just stays in cache and never actually hits disk, or at > > least > > >> hits disk in the background. Will we take a hit because we always > write > > to > > >> Kafka? > > >> > > > > > > I agree with Neha's comments here. One more point I want to make is > > > materializing to Kafka is not necessarily much worse than keeping data > in > > > memory if the downstream consumption is caught up such that most of the > > > reads will be hitting file cache. I remember Samza has illustrated that > > > under such scenarios its throughput is actually quite comparable to > Spark > > > Streaming / Storm. > > > > > > > > >> * I really struggled with the structure of the KIP template with > Copycat > > >> because the flow doesn't work well for proposals like this. They > aren't > > as > > >> concrete changes as the KIP template was designed for. I'd completely > > >> ignore that template in favor of optimizing for clarity if I were you. > > >> > > >> -Ewen > > >> > > >> On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang <wangg...@gmail.com> > > wrote: > > >> > > >> > Hi all, > > >> > > > >> > I just posted KIP-28: Add a transform client for data processing > > >> > < > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing > > >> > > > > >> > . > > >> > > > >> > The wiki page does not yet have the full design / implementation > > details, > > >> > and this email is to kick-off the conversation on whether we should > > add > > >> > this new client with the described motivations, and if yes what > > features > > >> / > > >> > functionalities should be included. > > >> > > > >> > Looking forward to your feedback! > > >> > > > >> > -- Guozhang > > >> > > > >> > > >> > > >> > > >> -- > > >> Thanks, > > >> Ewen > > >> > > > > > > > > > > > > -- > > > -- Guozhang > > > -- -- Guozhang