Weizhong created SPARK-14527:
--------------------------------
Summary: Job can't finish when restart all nodemanage when using
external shuffle services
Key: SPARK-14527
URL: https://issues.apache.org/jira/browse/SPARK-14527
Project: Spark
Issue Type: Bug
Components: Shuffle, Spark Core, YARN
Reporter: Weizhong
Priority: Minor
1) Submit a wordcount app
2) Stop all nodenamages when running 1st stage
3) After some minutes, start all nodemanages
Now, this job will failed at ResultStage and then retry ShuffleMapStage, and
then ResultStage failed again, it sill running in this loop, and can't finish
this job.
This is because when stop all NMs, all the Containers are still alive, but
executors info will lost which stored on NM(YarnShuffleService), so even if all
the NMs recover, the tasks will failed on ResultStage when fetch shuffle data.
{noformat}
16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220,
spark-1): FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0,
mapId=4, reduceId=2, message=
org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
Executor is not registered (appId=application_1459927459378_0005, execId=3)
...
16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have
all completed, from pool
16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at
wordcountWithSave.scala:21) and ResultStage 1 (saveAsTextFile at
wordcountWithSave.scala:32) due to fetch failure
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]