[
https://issues.apache.org/jira/browse/SPARK-18142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723252#comment-15723252
]
Shixiong Zhu commented on SPARK-18142:
--------------------------------------
Looks like we need a blacklist mechanism for workers.
> Spark Master tries to launch workers 145 times within 1 minute
> --------------------------------------------------------------
>
> Key: SPARK-18142
> URL: https://issues.apache.org/jira/browse/SPARK-18142
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.0.1
> Reporter: Burak Yavuz
>
> I observed a case where an instance running a worker was killed. The Spark
> Master tried to launch new executors at that instance, even though the
> instance didn't exist anymore and failed 145 times within 1 minute, and then
> killed the application.
> The instance takes ~10 minutes to be replaced. The master should at least
> have an exponential backoff mechanism when performing these retries so that
> it gives the infrastructure time to recover.
> {code}
> 16/10/27 17:31:18 INFO Master: Removing executor app-20161027124929-0000/3
> because it is EXITED
> 16/10/27 17:31:18 INFO Master: Launching executor app-20161027124929-0000/4
> on worker worker-20161027124917-10.0.43.232-60886
> 16/10/27 17:31:18 WARN Master: Got status update for unknown executor
> app-20161027124929-0000/3
> 16/10/27 17:31:18 INFO Master: Removing executor app-20161027124929-0000/4
> because it is FAILED
> 16/10/27 17:31:18 INFO Master: Launching executor app-20161027124929-0000/5
> on worker worker-20161027124917-10.0.43.232-60886
> 16/10/27 17:31:18 INFO Master: Removing executor app-20161027124929-0000/5
> because it is FAILED
> ...
> 16/10/27 17:31:37 INFO Master: 10.0.70.32:32829 got disassociated, removing
> it.
> 16/10/27 17:31:37 INFO Master: 10.0.70.32:40523 got disassociated, removing
> it.
> 16/10/27 17:31:37 INFO Master: Removing worker
> worker-20161027124917-10.0.70.32-40523 on 10.0.70.32:40523
> 16/10/27 17:31:37 INFO Master: Telling app of lost executor: 147
> 16/10/27 17:32:30 INFO Master: Removing executor app-20161027124929-0000/0
> because it is FAILED
> 16/10/27 17:32:30 ERROR Master: Application xxxx with ID
> app-20161027124929-0000 failed 145 times; removing it
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]