Re: Spark Kafka API tries connecting to dead node for every batch, which increases the processing time

Cody Koeninger Mon, 16 Oct 2017 07:05:16 -0700

Have you tried adjusting the timeout?

On Mon, Oct 16, 2017 at 8:08 AM, Suprith T Jain <t.supr...@gmail.com> wrote:
> Hi guys,
>
> I have a 3 node cluster and i am running a spark streaming job. consider the
> below example
>
> /*spark-submit* --master yarn-cluster --class
> com.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint --jars
> /opt/client/Spark/spark/lib/streamingClient/kafka-clients-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/kafka_2.10-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar
> /opt/SparkStreamingExample-1.0.jar  /tmp/test 10 test
> 189.132.190.106:21005,189.132.190.145:21005,10.1.1.1:21005/
>
> In this case, suppose node 10.1.1.1 is down. Then for every window batch,
> spark tries to send a request  to all the nodes.
> This code is in the class org.apache.spark.streaming.kafka.KafkaCluster
>
> Function : getPartitionMetadata()
> Line : val resp: TopicMetadataResponse = consumer.send(req)
>
> The function getPartitionMetadata() is called from getPartitions() and
> findLeaders() which gets called for every batch.
>
> Hence, if the node is down, the connection fails and it wits till the
> timeout to happen before continuing which adds to the processing time.
>
> Question :
> Is there any way to avoid this ?
> In simple words, i do not want spark to send request to the node that is
> down for every batch. How can i achieve this ?


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark Kafka API tries connecting to dead node for every batch, which increases the processing time

Reply via email to