[
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849781#comment-15849781
]
Dibyendu Bhattacharya commented on SPARK-11045:
-----------------------------------------------
Released latest version 1.0.10 of Receiver based Kafka Consumer for Spark
Streaming @ Spark-Packages :
https://spark-packages.org/package/dibbhatt/kafka-spark-consumer
This is now
- Tuned for better performance
- Support for Consumer Lag Check
- WAL less recovery
- Better tuned PID Controller having Auto Rate Adjustment with incoming traffic
- Support for Custom Message Interceptors
- Enhanced recovery scenarios from failures
Please refer to
https://github.com/dibbhatt/kafka-spark-consumer/blob/master/README.md for more
details
> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to
> Apache Spark Project
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
> Issue Type: New Feature
> Components: DStreams
> Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level
> Kafka Consumer from spark-packages
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over
> the time . I see there are many adoptions of this package . I receive
> positive feedbacks that this consumer gives better performance and fault
> tolerant capabilities.
> This is the primary intent of this JIRA to give community a better
> alternative if they want to use Receiver Base model.
> If this consumer make it to Spark Core, it will definitely see more adoption
> and support from community and help many who still prefer the Receiver Based
> model of Kafka Consumer.
> I understand the Direct Stream is the consumer which can give Exact Once
> semantics and uses Kafka Low Level API , which is good . But Direct Stream
> has concerns around recovering checkpoint on driver code change . Application
> developer need to manage their own offset which complex . Even if some one
> does manages their own offset , it limits the parallelism Spark Streaming can
> achieve. If someone wants more parallelism and want
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on
> storing offset externally as you have no control which batch will run in
> which sequence.
> Furthermore , the Direct Stream has higher latency , as it fetch messages
> form Kafka during RDD action . Also number of RDD partitions are limited to
> topic partition . So unless your Kafka topic does not have enough partitions,
> you have limited parallelism while RDD processing.
> Due to above mentioned concerns , many people who does not want Exactly Once
> semantics , still prefer Receiver based model. Unfortunately, when customer
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level
> Consumer, there are other issues around the reliability of Kafka High Level
> API. Kafka High Level API is buggy and has serious issue around Consumer
> Re-balance. Hence I do not think this is correct to advice people to use
> KafkaUtil.CreateStream in production .
> The better option presently is there is to use the Consumer from
> spark-packages . It is is using Kafka Low Level Consumer API , store offset
> in Zookeeper, and can recover from any failure . Below are few highlights of
> this consumer ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer , The Rate Limiting is done by modifying the size blocks
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the
> Rate Limiting is done by controlling number of messages. The issue with
> throttling by number of message is, if message size various, block size will
> also vary . Let say your Kafka has messages for different sizes from 10KB to
> 500 KB. Thus throttling by number of message can never give any deterministic
> size of your block hence there is no guarantee that Memory Back-Pressure can
> really take affect.
> 3. This consumer is using Kafka low level API which gives better performance
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache
> spark project , we will give community a better options for Kafka
> connectivity both for Receiver less and Receiver based model.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]