Julie Zhang created SPARK-21876:
-----------------------------------
Summary: Idling Executors that never handled any tasks are not
cleared from BlockManager after being removed
Key: SPARK-21876
URL: https://issues.apache.org/jira/browse/SPARK-21876
Project: Spark
Issue Type: Bug
Components: Scheduler, Spark Core
Affects Versions: 2.2.0, 1.6.3
Reporter: Julie Zhang
This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We use
Yarn as our resource manager.
1) Executor A is launched, but no task has been submitted to it;
2) After 'spark.dynamicAllocation.executorIdleTimeout' seconds, executor A will
be removed. (ExecutorAllocationManager.scala schedule(): 294);
3) The scheduler gets notified that executor A has been lost; (in our case,
YarnSchedulerBackend.scla: 209).
In the TaskschedulerImpl.scala method executorLost(executorId: String, reason:
ExecutorLossReason), the assumption in the None case(TaskSchedulerImpl.scala:
548) that the executor has already been removed is not always valid. As a
result, the DAGScheduler and BlockManagerMaster are never notified about the
loss of executor A.
When GC eventually happens, the ContextCleaner will try to clean up
un-referenced objects. Because the executor A was not removed from the
blockManagerIdByExecutor map, BlockManagerMasterEndpoint will send out requests
to clean the references to the non-existent executor, producing a lot of error
message like this in the driver log:
ERROR [2017-08-08 00:00:23,596]
org.apache.spark.network.client.TransportClient: Failed to send RPC xxx to
xxx/xxx:x: java.nio.channels.ClosedChannelException
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]