[
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan updated SPARK-21219:
--------------------------------
Fix Version/s: 2.2.1
> Task retry occurs on same executor due to race condition with blacklisting
> --------------------------------------------------------------------------
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 2.1.1
> Reporter: Eric Vandenberg
> Assignee: Eric Vandenberg
> Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is (1) added into the pending task list and then (2)
> corresponding black list policy is enforced (ie, specifying if it can/can't
> run on a particular node/executor/etc.) Unfortunately the ordering is such
> that retrying the task could assign the task to the same executor, which,
> incidentally could be shutting down and immediately fail the retry. Instead
> the order should be (1) the black list state should be updated and then (2)
> the task assigned, ensuring that the black list policy is properly enforced.
> The attached logs demonstrate the race condition.
> See spark_executor.log.anon:
> 1. Task 55.2 fails on the executor
> 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID
> 39575)
> java.lang.OutOfMemoryError: Java heap space
> 2. Immediately the same executor is assigned the retry task:
> 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
> 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
> 3. The retry task of course fails since the executor is also shutting down
> due to the original task 55.2 OOM failure.
> See the spark_driver.log.anon:
> The driver processes the lost task 55.2:
> 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID
> 39575, foobar####.masked-server.com, executor
> attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
> java.lang.OutOfMemoryError: Java heap space
> The driver then receives the ExecutorLostFailure for the retry task 55.3
> (although it's obfuscated in these logs, the server info is same...)
> 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID
> 39651, foobar####.masked-server.com, executor
> attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
> ExecutorLostFailure (executor
> attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0
> exited caused by one of the running tasks) Reason: Remote RPC client
> disassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]