[ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21219:
------------------------------------
    Description: 
When a task fails it is (1) added into the pending task list and then (2) 
corresponding black list policy is enforced (ie, specifying if it can/can't run 
on a particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the order should be (1) the black list state should be updated and then (2) the 
task assigned, ensuring that the black list policy is properly enforced.

The attached logs demonstrate the race condition.

See spark_executor.log.anon:

1. Task 55.2 fails on the executor

17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
39575)
java.lang.OutOfMemoryError: Java heap space

2. Immediately the same executor is assigned the retry task:

17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

3. The retry task of course fails since the executor is also shutting down due 
to the original task 55.2 OOM failure.

See the spark_driver.log.anon:

The driver processes the lost task 55.2:

17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, 
foobar####.masked-server.com, executor 
attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
 java.lang.OutOfMemoryError: Java heap space

The driver then receives the ExecutorLostFailure for the retry task 55.3 
(although it's obfuscated in these logs, the server info is same...)

17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, 
foobar####.masked-server.com, executor 
attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
 ExecutorLostFailure (executor 
attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0
 exited caused by one of the running tasks) Reason: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.


  was:
When a task fails it is added into the pending task list and corresponding 
black list policy is enforced (ie, specifying if it can/can't run on a 
particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the black list state should be updated and then the task assigned, ensuring 
that the black list policy is properly enforced.

The attached logs demonstrate the race condition.

See spark_executor.log.anon:

1. Task 55.2 fails on the executor

17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
39575)
java.lang.OutOfMemoryError: Java heap space

2. Immediately the same executor is assigned the retry task:

17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

3. The retry task of course fails since the executor is also shutting down due 
to the original task 55.2 OOM failure.

See the spark_driver.log.anon:

The driver processes the lost task 55.2:

17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, 
foobar####.masked-server.com, executor 
attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
 java.lang.OutOfMemoryError: Java heap space

The driver then receives the ExecutorLostFailure for the retry task 55.3 
(although it's obfuscated in these logs, the server info is same...)

17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, 
foobar####.masked-server.com, executor 
attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
 ExecutorLostFailure (executor 
attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0
 exited caused by one of the running tasks) Reason: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.



> Task retry occurs on same executor due to race condition with blacklisting
> --------------------------------------------------------------------------
>
>                 Key: SPARK-21219
>                 URL: https://issues.apache.org/jira/browse/SPARK-21219
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.1.1
>            Reporter: Eric Vandenberg
>            Priority: Minor
>         Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is (1) added into the pending task list and then (2) 
> corresponding black list policy is enforced (ie, specifying if it can/can't 
> run on a particular node/executor/etc.)  Unfortunately the ordering is such 
> that retrying the task could assign the task to the same executor, which, 
> incidentally could be shutting down and immediately fail the retry.   Instead 
> the order should be (1) the black list state should be updated and then (2) 
> the task assigned, ensuring that the black list policy is properly enforced.
> The attached logs demonstrate the race condition.
> See spark_executor.log.anon:
> 1. Task 55.2 fails on the executor
> 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
> 39575)
> java.lang.OutOfMemoryError: Java heap space
> 2. Immediately the same executor is assigned the retry task:
> 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
> 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
> 3. The retry task of course fails since the executor is also shutting down 
> due to the original task 55.2 OOM failure.
> See the spark_driver.log.anon:
> The driver processes the lost task 55.2:
> 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 
> 39575, foobar####.masked-server.com, executor 
> attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
>  java.lang.OutOfMemoryError: Java heap space
> The driver then receives the ExecutorLostFailure for the retry task 55.3 
> (although it's obfuscated in these logs, the server info is same...)
> 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 
> 39651, foobar####.masked-server.com, executor 
> attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
>  ExecutorLostFailure (executor 
> attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0
>  exited caused by one of the running tasks) Reason: Remote RPC client 
> disassociated. Likely due to containers exceeding thresholds, or network 
> issues. Check driver logs for WARN messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to