Re: Low Level Kafka Consumer for Spark

Tathagata Das Wed, 06 Aug 2014 16:18:08 -0700

Hi Dibyendu,

This is really awesome. I am still yet to go through the code to understand
the details, but I want to do it really soon. In particular, I want to
understand the improvements, over the existing Kafka receiver.


And its fantastic to see such contributions from the community. :)

TD

On Tue, Aug 5, 2014 at 8:38 AM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:

> Hi
>
> This fault tolerant aspect already taken care in the Kafka-Spark Consumer
> code , like if Leader of a partition changes etc.. in ZkCoordinator.java.
> Basically it does a refresh of PartitionManagers every X seconds to make
> sure Partition details is correct and consumer don't fail.
>
> Dib
>
>
> On Tue, Aug 5, 2014 at 8:01 PM, Shao, Saisai <saisai.s...@intel.com>
> wrote:
>
> > Hi,
> >
> > I think this is an awesome feature for Spark Streaming Kafka interface to
> > offer user the controllability of partition offset, so user can have more
> > applications based on this.
> >
> > What I concern is that if we want to do offset management, fault tolerant
> > related control and others, we have to take the role as current
> > ZookeeperConsumerConnect did, that would be a big field we should take
> care
> > of, for example when node is failed, how to pass current partition to
> > another consumer and some others. I’m not sure what is your thought?
> >
> > Thanks
> > Jerry
> >
> > From: Dibyendu Bhattacharya [mailto:dibyendu.bhattach...@gmail.com]
> > Sent: Tuesday, August 05, 2014 5:15 PM
> > To: Jonathan Hodges; dev@spark.apache.org
> > Cc: user
> > Subject: Re: Low Level Kafka Consumer for Spark
> >
> > Thanks Jonathan,
> >
> > Yes, till non-ZK based offset management is available in Kafka, I need to
> > maintain the offset in ZK. And yes, both cases explicit commit is
> > necessary. I modified the Low Level Kafka Spark Consumer little bit to
> have
> > Receiver spawns threads for every partition of the topic and perform the
> > 'store' operation in multiple threads. It would be good if the
> > receiver.store methods are made thread safe..which is not now presently .
> >
> > Waiting for TD's comment on this Kafka Spark Low Level consumer.
> >
> >
> > Regards,
> > Dibyendu
> >
> >
> > On Tue, Aug 5, 2014 at 5:32 AM, Jonathan Hodges <hodg...@gmail.com
> <mailto:
> > hodg...@gmail.com>> wrote:
> > Hi Yan,
> >
> > That is a good suggestion.  I believe non-Zookeeper offset management
> will
> > be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled
> for
> > September.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management
> >
> > That should make this fairly easy to implement, but it will still require
> > explicit offset commits to avoid data loss which is different than the
> > current KafkaUtils implementation.
> >
> > Jonathan
> >
> >
> >
> >
> > On Mon, Aug 4, 2014 at 4:51 PM, Yan Fang <yanfang...@gmail.com<mailto:
> > yanfang...@gmail.com>> wrote:
> > Another suggestion that may help is that, you can consider use Kafka to
> > store the latest offset instead of Zookeeper. There are at least two
> > benefits: 1) lower the workload of ZK 2) support replay from certain
> > offset. This is how Samza<http://samza.incubator.apache.org/> deals with
> > the Kafka offset, the doc is here<
> >
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html
> >
> > . Thank you.
> >
> > Cheers,
> >
> > Fang, Yan
> > yanfang...@gmail.com<mailto:yanfang...@gmail.com>
> > +1 (206) 849-4108<tel:%2B1%20%28206%29%20849-4108>
> >
> > On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <pwend...@gmail.com
> > <mailto:pwend...@gmail.com>> wrote:
> > I'll let TD chime on on this one, but I'm guessing this would be a
> welcome
> > addition. It's great to see community effort on adding new
> > streams/receivers, adding a Java API for receivers was something we did
> > specifically to allow this :)
> >
> > - Patrick
> >
> > On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya <
> > dibyendu.bhattach...@gmail.com<mailto:dibyendu.bhattach...@gmail.com>>
> > wrote:
> > Hi,
> >
> > I have implemented a Low Level Kafka Consumer for Spark Streaming using
> > Kafka Simple Consumer API. This API will give better control over the
> Kafka
> > offset management and recovery from failures. As the present Spark
> > KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
> > control over the offset management which is not possible in Kafka
> HighLevel
> > consumer.
> >
> > This Project is available in below Repo :
> >
> > https://github.com/dibbhatt/kafka-spark-consumer
> >
> >
> > I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver.
> > The KafkaReceiver uses low level Kafka Consumer API (implemented in
> > consumer.kafka packages) to fetch messages from Kafka and 'store' it in
> > Spark.
> >
> > The logic will detect number of partitions for a topic and spawn that
> many
> > threads (Individual instances of Consumers). Kafka Consumer uses
> Zookeeper
> > for storing the latest offset for individual partitions, which will help
> to
> > recover in case of failure. The Kafka Consumer logic is tolerant to ZK
> > Failures, Kafka Leader of Partition changes, Kafka broker failures,
> >  recovery from offset errors and other fail-over aspects.
> >
> > The consumer.kafka.client.Consumer is the sample Consumer which uses this
> > Kafka Receivers to generate DStreams from Kafka and apply a Output
> > operation for every messages of the RDD.
> >
> > We are planning to use this Kafka Spark Consumer to perform Near Real
> Time
> > Indexing of Kafka Messages to target Search Cluster and also Near Real
> Time
> > Aggregation using target NoSQL storage.
> >
> > Kindly let me know your view. Also if this looks good, can I contribute
> to
> > Spark Streaming project.
> >
> > Regards,
> > Dibyendu
> >
> >
> >
> >
> >
>

Re: Low Level Kafka Consumer for Spark

Reply via email to