Sumeet created SPARK-34949:
------------------------------
Summary: Executor.reportHeartBeat reregisters blockManager even
when Executor is shutting down
Key: SPARK-34949
URL: https://issues.apache.org/jira/browse/SPARK-34949
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.2.0
Environment: Resource Manager: K8s
Reporter: Sumeet
*Problem:*
I was testing Dynamic Allocation on K8s with about 300 executors. While doing
so, when the executors were torn down due to
"spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods
being removed from K8s, however, under the "Executors" tab in SparkUI, I could
see some executors listed as alive.
[spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
also returned a value greater than 1.
*Cause:*
* "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a
"executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the
"listenerBus"
* "CoarseGrainedExecutorBackend" starts the executor shutdown
* "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and
removes the executor from "executorLastSeen"
* In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver"
cannot find the "executorId" in "executorLastSeen" and hence responds with
"HeartbeatResponse(reregisterBlockManager = true)"
* The Executor now calls "env.blockManager.reregister()" and reregisters
itself thus creating inconsistency
*Proposed Solution:*
The "reportHeartBeat" method is not aware of the fact that Executor is shutting
down, it should check "executorShutdown" before reregistering.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]