[
https://issues.apache.org/jira/browse/SPARK-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weizhong updated SPARK-14527:
-----------------------------
Summary: Job can't finish when restart all nodemanages with using external
shuffle services (was: Job can't finish when restart all nodemanage when using
external shuffle services)
> Job can't finish when restart all nodemanages with using external shuffle
> services
> ----------------------------------------------------------------------------------
>
> Key: SPARK-14527
> URL: https://issues.apache.org/jira/browse/SPARK-14527
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, Spark Core, YARN
> Reporter: Weizhong
> Priority: Minor
>
> 1) Submit a wordcount app
> 2) Stop all nodenamages when running 1st stage
> 3) After some minutes, start all nodemanages
> Now, this job will failed at ResultStage and then retry ShuffleMapStage, and
> then ResultStage failed again, it sill running in this loop, and can't finish
> this job.
> This is because when stop all NMs, all the Containers are still alive, but
> executors info will lost which stored on NM(YarnShuffleService), so even if
> all the NMs recover, the tasks will failed on ResultStage when fetch shuffle
> data.
> {noformat}
> 16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220,
> spark-1): FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0,
> mapId=4, reduceId=2, message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
> Executor is not registered (appId=application_1459927459378_0005, execId=3)
> ...
> 16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have
> all completed, from pool
> 16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at
> wordcountWithSave.scala:21) and ResultStage 1 (saveAsTextFile at
> wordcountWithSave.scala:32) due to fetch failure
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]