GitHub user pnakhe opened a pull request:

    https://github.com/apache/spark/pull/16278

    [SPARK-18779][STREAMING][KAFKA] Messages being received only from one 
partition when using Spark Streaming integration for Kafka 0.10 with kafka 
client library at 0.10.1

    ## What changes were proposed in this pull request?
    
    This pull request is to fix for SPARK-18779. When using kafka 0.10.1.0 
messages are being read only from one partition. The current kafka-spark 0.10 
integration ships with kafka 0.10.0.1 where messages are read from all 
partitions but using kafka client 0.10.1.0 client, messages are read from only 
one partition.
    
    In the ConsumerStrategy class there is a pause on the consumer. We never 
resume the consumer and that seems to causing the issue. The KafkaConsumer 
implementation has changed between 10.0.1 and 10.1.0 which has exposed this 
issue. The solution to this issue is to resume the consumer before we find the 
position in DirectKafkaInputDStream class in the latestOffsets method.The 
reason the issue is not seen in the current setup is because pause/resume logic 
is changed in the latest kafka version. We dont seem to have a resume for the 
pause and hence this fix is necessary.
    
    This patch fixes the issue.
    
    ## How was this patch tested?
    The spark-kafka test cases were run to check no regressions were caused. I 
have checked that messages are being read from all partitions for both 0.10.0.1 
kafka client and 0.10.1.0 client.
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pnakhe/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16278
    
----
commit a13036ae6c00b3c6ca128f5493dc30eeac3f8e0f
Author: pnakhe <[email protected]>
Date:   2016-12-14T10:59:18Z

    Fix for SPARK-18779
    
    In the ConsumerStrategy class there is a pause on the consumer. We never 
resume the consumer and that seems to causing the issue. The KafkaConsumer 
implementation has changed between 10.0.1 and 10.1.0 which has exposed this 
issue. The solution to this issue is to resume the consumer before we find the 
position in DirectKafkaInputDStream class in the latestOffsets method.
    I have tested this fix and it works fine. The reason the issue is not seen 
in the current setup is because pause/resume logic is changed in the latest 
kafka version. We dont seem to have a resume for the pause and hence this fix 
is necessary.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to