[ 
https://issues.apache.org/jira/browse/SPARK-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weizhong resolved SPARK-14527.
------------------------------
       Resolution: Fixed
    Fix Version/s: 1.6.1

> Job can't finish when restart all nodemanages with using external shuffle 
> services
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14527
>                 URL: https://issues.apache.org/jira/browse/SPARK-14527
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core, YARN
>            Reporter: Weizhong
>            Priority: Minor
>             Fix For: 1.6.1
>
>
> 1) Submit a wordcount app
> 2) Stop all nodenamages when running ShuffleMapStage
> 3) After some minutes, start all nodemanages
> Now, this job will failed at ResultStage and then retry ShuffleMapStage, and 
> then ResultStage failed again, it sill running in this loop, and can't finish 
> this job.
> This is because when stop all NMs, all the containers are still alive, but 
> executors info will lost which stored on NM(YarnShuffleService), so even if 
> all the NMs recover, the tasks will failed on ResultStage when fetch shuffle 
> data.
> {noformat}
> 16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220, 
> spark-1): FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0, 
> mapId=4, reduceId=2, message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: 
> Executor is not registered (appId=application_1459927459378_0005, execId=3)
> ...
> 16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have 
> all completed, from pool
> 16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at 
> wordcountWithSave.scala:21) and ResultStage 1 (saveAsTextFile at 
> wordcountWithSave.scala:32) due to fetch failure
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to