Hi Dibyendu, This is really awesome. I am still yet to go through the code to understand the details, but I want to do it really soon. In particular, I want to understand the improvements, over the existing Kafka receiver.
And its fantastic to see such contributions from the community. :) TD On Tue, Aug 5, 2014 at 8:38 AM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Hi > > This fault tolerant aspect already taken care in the Kafka-Spark Consumer > code , like if Leader of a partition changes etc.. in ZkCoordinator.java. > Basically it does a refresh of PartitionManagers every X seconds to make > sure Partition details is correct and consumer don't fail. > > Dib > > > On Tue, Aug 5, 2014 at 8:01 PM, Shao, Saisai <saisai.s...@intel.com> > wrote: > > > Hi, > > > > I think this is an awesome feature for Spark Streaming Kafka interface to > > offer user the controllability of partition offset, so user can have more > > applications based on this. > > > > What I concern is that if we want to do offset management, fault tolerant > > related control and others, we have to take the role as current > > ZookeeperConsumerConnect did, that would be a big field we should take > care > > of, for example when node is failed, how to pass current partition to > > another consumer and some others. I’m not sure what is your thought? > > > > Thanks > > Jerry > > > > From: Dibyendu Bhattacharya [mailto:dibyendu.bhattach...@gmail.com] > > Sent: Tuesday, August 05, 2014 5:15 PM > > To: Jonathan Hodges; dev@spark.apache.org > > Cc: user > > Subject: Re: Low Level Kafka Consumer for Spark > > > > Thanks Jonathan, > > > > Yes, till non-ZK based offset management is available in Kafka, I need to > > maintain the offset in ZK. And yes, both cases explicit commit is > > necessary. I modified the Low Level Kafka Spark Consumer little bit to > have > > Receiver spawns threads for every partition of the topic and perform the > > 'store' operation in multiple threads. It would be good if the > > receiver.store methods are made thread safe..which is not now presently . > > > > Waiting for TD's comment on this Kafka Spark Low Level consumer. > > > > > > Regards, > > Dibyendu > > > > > > On Tue, Aug 5, 2014 at 5:32 AM, Jonathan Hodges <hodg...@gmail.com > <mailto: > > hodg...@gmail.com>> wrote: > > Hi Yan, > > > > That is a good suggestion. I believe non-Zookeeper offset management > will > > be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled > for > > September. > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management > > > > That should make this fairly easy to implement, but it will still require > > explicit offset commits to avoid data loss which is different than the > > current KafkaUtils implementation. > > > > Jonathan > > > > > > > > > > On Mon, Aug 4, 2014 at 4:51 PM, Yan Fang <yanfang...@gmail.com<mailto: > > yanfang...@gmail.com>> wrote: > > Another suggestion that may help is that, you can consider use Kafka to > > store the latest offset instead of Zookeeper. There are at least two > > benefits: 1) lower the workload of ZK 2) support replay from certain > > offset. This is how Samza<http://samza.incubator.apache.org/> deals with > > the Kafka offset, the doc is here< > > > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html > > > > . Thank you. > > > > Cheers, > > > > Fang, Yan > > yanfang...@gmail.com<mailto:yanfang...@gmail.com> > > +1 (206) 849-4108<tel:%2B1%20%28206%29%20849-4108> > > > > On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <pwend...@gmail.com > > <mailto:pwend...@gmail.com>> wrote: > > I'll let TD chime on on this one, but I'm guessing this would be a > welcome > > addition. It's great to see community effort on adding new > > streams/receivers, adding a Java API for receivers was something we did > > specifically to allow this :) > > > > - Patrick > > > > On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya < > > dibyendu.bhattach...@gmail.com<mailto:dibyendu.bhattach...@gmail.com>> > > wrote: > > Hi, > > > > I have implemented a Low Level Kafka Consumer for Spark Streaming using > > Kafka Simple Consumer API. This API will give better control over the > Kafka > > offset management and recovery from failures. As the present Spark > > KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better > > control over the offset management which is not possible in Kafka > HighLevel > > consumer. > > > > This Project is available in below Repo : > > > > https://github.com/dibbhatt/kafka-spark-consumer > > > > > > I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver. > > The KafkaReceiver uses low level Kafka Consumer API (implemented in > > consumer.kafka packages) to fetch messages from Kafka and 'store' it in > > Spark. > > > > The logic will detect number of partitions for a topic and spawn that > many > > threads (Individual instances of Consumers). Kafka Consumer uses > Zookeeper > > for storing the latest offset for individual partitions, which will help > to > > recover in case of failure. The Kafka Consumer logic is tolerant to ZK > > Failures, Kafka Leader of Partition changes, Kafka broker failures, > > recovery from offset errors and other fail-over aspects. > > > > The consumer.kafka.client.Consumer is the sample Consumer which uses this > > Kafka Receivers to generate DStreams from Kafka and apply a Output > > operation for every messages of the RDD. > > > > We are planning to use this Kafka Spark Consumer to perform Near Real > Time > > Indexing of Kafka Messages to target Search Cluster and also Near Real > Time > > Aggregation using target NoSQL storage. > > > > Kindly let me know your view. Also if this looks good, can I contribute > to > > Spark Streaming project. > > > > Regards, > > Dibyendu > > > > > > > > > > >