The goal of this KIP is to provide a lightweight/embeddable streaming framework, and allows Kafka users to start using stream processing easily. DSL is not covered in this KIP. But, DSL is a very attractive option to have.
> In the proposed KafkaProcessor API, there is no interface like Collector that allows users to send messages to. Why is that? It is not stated in the KIP, but Context provides a simple interface to a producer. >Won’t it be simpler if the process() API just takes in the ConsumerRecord as the input instead of a tuple of (topic, key, value)? If Kafka implement a simple DSL something like the one in Spark, I think ConsumerRecord may not be the most convenient thing for the framework or the most intuitive thing for users. I don't think we need "topic" in the arguments. Think about a most simple application, all it needs is a key and a value. That makes the API simpler. If the application needs to access more info (topic, offset), Context should provide them. On Fri, Jul 24, 2015 at 12:57 AM, Yi Pan <nickpa...@gmail.com> wrote: > Hi, Guozhang, > > Thanks for starting this. I took a quick look and had the following > thoughts to share: > > - In the proposed KafkaProcessor API, there is no interface like Collector > that allows users to send messages to. Why is that? Is the idea to > initialize the producer once and re-use it in the processor? And if there > are many KStreamThreads in the process, are there going to be many > instances of KafkaProducer although all outputs are sending to the same > Kafka cluster? > > - Won’t it be simpler if the process() API just takes in the ConsumerRecord > as the input instead of a tuple of (topic, key, value)? > > - Also, the input only indicates the topic of a message. What if the stream > task needs to consume and produce messages from/to multiple Kafka clusters? > To support that case, there should be a system/cluster name in both input > and output as well. > > - How are the output messages handled? There does not seem to have an > interface that allows user to send an output messages to multiple output > Kafka clusters. > > - It seems the proposed model also assumes one thread per processor. What > becomes thread-local and what are shared among processors? Is the proposed > model targeting to have the consumers/producers become thread-local > instances within each KafkaProcessor? What’s the cost associated with this > model? > > - One more important issue: how do we plug-in client-side partition > management logic? Considering about the use case where the stream task > needs to consume from multiple Kafka clusters, I am not even sure that we > can rely on Kafka broker to maintain the consumer group membership? Maybe > we still can get the per cluster consumer group membership and partitions. > However, in this case, we truly need a client-side plugin partition > management logic to determine how to assign partitions in different Kafka > clusters to consumers (i.e. consumers for cluster1.topic1.p1 and > cluster2.topic2.p1 has to be assigned together to one KafkaProcessor for > processing). Based on the full information about (group members, all topic > partitions) in all Kafka clusters with input topics, there should be two > levels of partition management policies: a) how to group all topic > partitions in all Kafka clusters to processor groups (i.e. the same concept > as Task group in Samza); b) how to assign the processor groups to group > members. Note if a processor group includes topic partitions from more than > one Kafka clusters, it has to be assigned to the common group members in > all relevant Kafka clusters. This can not be done just by the brokers in a > single Kafka cluster. > > - It seems that the intention of this KIP is also trying to put SQL/DSL > libraries into Kafka. Why is it? Shouldn't Kafka be more focused on hiding > system-level integration details and leave it open for any additional > modules outside the Kafka core to enrich the functionality that are > user-facing? > > Just a few quick cents. Thanks a lot! > > -Yi > > On Fri, Jul 24, 2015 at 12:12 AM, Neha Narkhede <n...@confluent.io> wrote: > > > Ewen: > > > > * I think trivial filtering and aggregation on a single stream usually > work > > > fine with this model. > > > > > > The way I see this, the process() API is an abstraction for > > message-at-a-time computations. In the future, you could imagine > providing > > a simple DSL layer on top of the process() API that provides a set of > APIs > > for stream processing operations on sets of messages like joins, windows > > and various aggregations. > > > > * Spark (and presumably > > > spark streaming) is supposed to get a big win by handling shuffles such > > > that the data just stays in cache and never actually hits disk, or at > > least > > > hits disk in the background. Will we take a hit because we always write > > to > > > Kafka? > > > > > > The goal isn't so much about forcing materialization of intermediate > > results into Kafka but designing the API to integrate with Kafka to allow > > such materialization, wherever that might be required. The downside with > > other stream processing frameworks is that they have weak integration > with > > Kafka where interaction with Kafka is only at the endpoints of processing > > (first input, final output). Any intermediate operations that might > benefit > > from persisting intermediate results into Kafka are forced to be broken > up > > into 2 separate topologies/plans/stages of processing that lead to more > > jobs. The implication is that now the set of stream processing operations > > that should really have lived in one job per application is now split up > > across several piecemeal jobs that need to be monitored, managed and > > operated separately. The APIs should still allows in-memory storage of > > intermediate results where they make sense. > > > > Jiangjie, > > > > I just took a quick look at the KIP, is it very similar to mirror maker > > > with message handler? > > > > > > Not really. I wouldn't say it is similar, but mirror maker is a special > > instance of using copycat with Kafka source, sink + optionally the > > process() API. I can imagine replacing the MirrorMaker, in the due course > > of time, with copycat + process(). > > > > Thanks, > > Neha > > > > On Thu, Jul 23, 2015 at 11:32 PM, Jiangjie Qin <j...@linkedin.com.invalid > > > > wrote: > > > > > Hey Guozhang, > > > > > > I just took a quick look at the KIP, is it very similar to mirror maker > > > with message handler? > > > > > > Thanks, > > > > > > Jiangjie (Becket) Qin > > > > > > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava < > > e...@confluent.io > > > > > > > wrote: > > > > > > > Just some notes on the KIP doc itself: > > > > > > > > * It'd be useful to clarify at what point the plain consumer + custom > > > code > > > > + producer breaks down. I think trivial filtering and aggregation on > a > > > > single stream usually work fine with this model. Anything where you > > need > > > > more complex joins, windowing, etc. are where it breaks down. I think > > > most > > > > interesting applications require that functionality, but it's helpful > > to > > > > make this really clear in the motivation -- right now, Kafka only > > > provides > > > > the lowest level plumbing for stream processing applications, so most > > > > interesting apps require very heavyweight frameworks. > > > > * I think the feature comparison of plain producer/consumer, stream > > > > processing frameworks, and this new library is a good start, but we > > might > > > > want something more thorough and structured, like a feature matrix. > > Right > > > > now it's hard to figure out exactly how they relate to each other. > > > > * I'd personally push the library vs. framework story very strongly > -- > > > the > > > > total buy-in and weak integration story of stream processing > frameworks > > > is > > > > a big downside and makes a library a really compelling (and currently > > > > unavailable, as far as I am aware) alternative. > > > > * Comment about in-memory storage of other frameworks is interesting > -- > > > it > > > > is specific to the framework, but is supposed to also give > performance > > > > benefits. The high-level functional processing interface would allow > > for > > > > combining multiple operations when there's no shuffle, but when there > > is > > > a > > > > shuffle, we'll always be writing to Kafka, right? Spark (and > presumably > > > > spark streaming) is supposed to get a big win by handling shuffles > such > > > > that the data just stays in cache and never actually hits disk, or at > > > least > > > > hits disk in the background. Will we take a hit because we always > write > > > to > > > > Kafka? > > > > * I really struggled with the structure of the KIP template with > > Copycat > > > > because the flow doesn't work well for proposals like this. They > aren't > > > as > > > > concrete changes as the KIP template was designed for. I'd completely > > > > ignore that template in favor of optimizing for clarity if I were > you. > > > > > > > > -Ewen > > > > > > > > On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang <wangg...@gmail.com> > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > I just posted KIP-28: Add a transform client for data processing > > > > > < > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing > > > > > > > > > > > . > > > > > > > > > > The wiki page does not yet have the full design / implementation > > > details, > > > > > and this email is to kick-off the conversation on whether we should > > add > > > > > this new client with the described motivations, and if yes what > > > features > > > > / > > > > > functionalities should be included. > > > > > > > > > > Looking forward to your feedback! > > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > -- > > > > Thanks, > > > > Ewen > > > > > > > > > > > > > > > -- > > Thanks, > > Neha > > >