GitHub user pnakhe opened a pull request:
https://github.com/apache/spark/pull/16278
[SPARK-18779][STREAMING][KAFKA] Messages being received only from one
partition when using Spark Streaming integration for Kafka 0.10 with kafka
client library at 0.10.1
## What changes were proposed in this pull request?
This pull request is to fix for SPARK-18779. When using kafka 0.10.1.0
messages are being read only from one partition. The current kafka-spark 0.10
integration ships with kafka 0.10.0.1 where messages are read from all
partitions but using kafka client 0.10.1.0 client, messages are read from only
one partition.
In the ConsumerStrategy class there is a pause on the consumer. We never
resume the consumer and that seems to causing the issue. The KafkaConsumer
implementation has changed between 10.0.1 and 10.1.0 which has exposed this
issue. The solution to this issue is to resume the consumer before we find the
position in DirectKafkaInputDStream class in the latestOffsets method.The
reason the issue is not seen in the current setup is because pause/resume logic
is changed in the latest kafka version. We dont seem to have a resume for the
pause and hence this fix is necessary.
This patch fixes the issue.
## How was this patch tested?
The spark-kafka test cases were run to check no regressions were caused. I
have checked that messages are being read from all partitions for both 0.10.0.1
kafka client and 0.10.1.0 client.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/pnakhe/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16278.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16278
----
commit a13036ae6c00b3c6ca128f5493dc30eeac3f8e0f
Author: pnakhe <[email protected]>
Date: 2016-12-14T10:59:18Z
Fix for SPARK-18779
In the ConsumerStrategy class there is a pause on the consumer. We never
resume the consumer and that seems to causing the issue. The KafkaConsumer
implementation has changed between 10.0.1 and 10.1.0 which has exposed this
issue. The solution to this issue is to resume the consumer before we find the
position in DirectKafkaInputDStream class in the latestOffsets method.
I have tested this fix and it works fine. The reason the issue is not seen
in the current setup is because pause/resume logic is changed in the latest
kafka version. We dont seem to have a resume for the pause and hence this fix
is necessary.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]