[ 
https://issues.apache.org/jira/browse/SPARK-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045564#comment-14045564
 ] 

Patrick Wendell commented on SPARK-2228:
----------------------------------------

I ran your reproduction locally. What I found was that it just generates events 
more quickly than the listener can process, so that was triggering all of the 
subsequent errors:

{code}
$ cat job-log.txt |grep ERROR | head -n 10
14/06/26 22:41:02 ERROR scheduler.LiveListenerBus: Dropping SparkListenerEvent 
because no remaining room in event queue. This likely means one of the 
SparkListeners is too slow and cannot keep up withthe rate at which tasks are 
being started by the scheduler.
14/06/26 22:42:01 ERROR scheduler.LiveListenerBus: Listener JobProgressListener 
threw an exception
14/06/26 22:42:01 ERROR scheduler.LiveListenerBus: Listener JobProgressListener 
threw an exception
14/06/26 22:42:01 ERROR scheduler.LiveListenerBus: Listener JobProgressListener 
threw an exception
14/06/26 22:42:01 ERROR scheduler.LiveListenerBus: Listener JobProgressListener 
threw an exception
14/06/26 22:42:01 ERROR scheduler.LiveListenerBus: Listener JobProgressListener 
threw an exception
14/06/26 22:42:01 ERROR scheduler.LiveListenerBus: Listener JobProgressListener 
threw an exception
{code}

If someone submits a job that creates thousands of stages in a few seconds this 
can happen. But I haven't seen it happen in a real production job that does 
actual nontrivial work inside of the stage.

We could consider an alternative design that applies back pressure instead of 
dropping events.

> onStageSubmitted does not properly called so NoSuchElement will be thrown in 
> onStageCompleted
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-2228
>                 URL: https://issues.apache.org/jira/browse/SPARK-2228
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Baoxu Shi
>
> We are using `SaveAsObjectFile` and `objectFile` to cut off lineage during 
> iterative computing, but after several hundreds of iterations, there will be 
> `NoSuchElementsError`. We check the code and locate the problem at 
> `org.apache.spark.ui.jobs.JobProgressListener`. When `onStageCompleted` is 
> called, such `stageId` can not be found in `stageIdToPool`, but it does exist 
> in other HashMaps. So we think `onStageSubmitted` is not properly called. 
> `Spark` did add a stage but failed to send the message to listeners. When 
> sending `finish` message to listeners, the error occurs. 
> This problem will cause a huge number of `active stages` showing in 
> `SparkUI`, which is really annoying. But it may not affect the final result, 
> according to the result of my testing code.
> I'm willing to help solve this problem, any idea about which part should I 
> change? I assume `org.apache.spark.scheduler.SparkListenerBus` have something 
> to do with it but it looks fine to me.
> FYI, here is the test code that could reproduce the problem. I do not know 
> who to put code here with highlight, so I put the code on gist to make the 
> issue looks clean.
> https://gist.github.com/bxshi/b5c0fe0ae089c75a39bd



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to