[
https://issues.apache.org/jira/browse/SPARK-15408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293437#comment-15293437
]
Cody Koeninger commented on SPARK-15408:
----------------------------------------
By design, the Kafka direct stream is not going to try to hide Kafka failures
from you.
The number of times the driver will retry is controlled by
spark.streaming.kafka.maxRetries
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
The number of failures for any individual task is controlled the same way as
any other spark job, spark.task.maxFailures
> Spark streaming app crashes with NotLeaderForPartitionException
> ----------------------------------------------------------------
>
> Key: SPARK-15408
> URL: https://issues.apache.org/jira/browse/SPARK-15408
> Project: Spark
> Issue Type: Bug
> Components: Streaming
> Affects Versions: 1.6.0
> Environment: Ubuntu 64 bit
> Reporter: Johny Mathew
> Priority: Critical
>
> We have a spark streaming application reading from kafka (with Kafka Direct
> API) and it crashed with the exception shown in the next paragraph. We have a
> 5 node kafka cluster with 19 partitions (replication factor 3). Even though
> the the spark application crashed the other kafka consumer apps were running
> fine. Only one of the 5 kafka node was not working correctly (it did not go
> down)
> /opt/hadoop/bin/yarn application -status application_1463151451543_0007
> 16/05/13 20:09:56 INFO client.RMProxy: Connecting to ResourceManager at
> /172.16.130.189:8050
> Application Report :
> Application-Id : application_1463151451543_0007
> Application-Name : com.ibm.alchemy.eventgen.EventGenMetrics
> Application-Type : SPARK
> User : stack
> Queue : default
> Start-Time : 1463155034571
> Finish-Time : 1463155310520
> Progress : 100%
> State : FINISHED
> Final-State : FAILED
> Tracking-URL : N/A
> RPC Port : 0
> AM Host : 172.16.130.188
> Aggregate Resource Allocation : 9562329 MB-seconds, 2393 vcore-seconds
> Diagnostics : User class threw exception:
> org.apache.spark.SparkException:
> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
> kafka.common.NotLeaderForPartitionException,
> kafka.common.NotLeaderForPartitionException,
> kafka.common.NotLeaderForPartitionException,
> kafka.common.NotLeaderForPartitionException,
> kafka.common.NotLeaderForPartitionException,
> kafka.common.NotLeaderForPartitionException,
> kafka.common.NotLeaderForPartitionException, org.apache.spark.SparkException:
> Couldn't find leader offsets for Set([alchemy-metrics,17],
> [alchemy-metrics,10], [alchemy-metrics,3], [alchemy-metrics,4],
> [alchemy-metrics,9], [alchemy-metrics,15], [alchemy-metrics,18],
> [alchemy-metrics,5]))
> We cleared checkpoint and started the application but it crashed again. Then
> at the end we found out the misbehaving kafka node and restarted it which
> fixed the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]