[
https://issues.apache.org/jira/browse/SPARK-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056536#comment-14056536
]
Tathagata Das commented on SPARK-1975:
--------------------------------------
The ReceiverTracker stage actually starts the receivers as tasks, and hence
runs permanently until the streaming context is stopped.
Since one core is allocated to each receiver, it is true that there has to more
cores than the number of receivers / number of input DStreams. If you are
creating one input dstream for each kafka topic partition (i.e. 30) and there
are <= 30 cores given to the spark program, then this will occur. I suggest
adding multiple partitions to each input dstream.
> Spark streaming with kafka source stuck at runJob at ReceiverTracker.scala:275
> ------------------------------------------------------------------------------
>
> Key: SPARK-1975
> URL: https://issues.apache.org/jira/browse/SPARK-1975
> Project: Spark
> Issue Type: Bug
> Components: Streaming
> Affects Versions: 1.0.0
> Reporter: Issac Buenrostro
>
> Spark streaming application running on YARN. We have a Kafka topic with 30
> partitions. We create 30 Kafka streams each consuming from a single partition.
> Looking at the spark stages, we see the following:
> collect at ReceiverTracker.scala:270 finished in 0.3s
> reduceByKey at ReceiverTracker.scala:270 finished in 3s
> runJob at ReceiverTracker.scala:275 has been running for 12+ minutes, no
> progress
> map at core.scala:224 (our processing class), has not started
> It seems to me that the ReceiverTracker is intended to run permanently in the
> background, but the scheduler is waiting for it to finish before scheduling
> other tasks?
--
This message was sent by Atlassian JIRA
(v6.2#6252)