Ngone51 commented on code in PR #38876:
URL: https://github.com/apache/spark/pull/38876#discussion_r1040544977
##########
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##########
@@ -583,7 +586,12 @@ class BlockManagerMasterEndpoint(
val time = System.currentTimeMillis()
executorIdToLocalDirs.put(id.executorId, localDirs)
- if (!blockManagerInfo.contains(id)) {
+ // SPARK-41360: For the block manager re-registration, we should only
allow it when
+ // the executor is recognized as active by the scheduler backend.
Otherwise, this kind
+ // of re-registration from the terminating/stopped executor is meaningless
and harmful.
+ lazy val isExecutorAlive =
+
driverEndpoint.askSync[Boolean](CoarseGrainedClusterMessages.IsExecutorAlive(id.executorId))
+ if (!blockManagerInfo.contains(id) && (!isReRegister || isExecutorAlive)) {
Review Comment:
> Wont the driver not remove in case of a heartbeat expiry even though the
executor did not disconnect ?
It will. But in these cases, the driver could fail to kill the executor.
(@mridulm No worries, take care:))
##########
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##########
@@ -583,7 +586,12 @@ class BlockManagerMasterEndpoint(
val time = System.currentTimeMillis()
executorIdToLocalDirs.put(id.executorId, localDirs)
- if (!blockManagerInfo.contains(id)) {
+ // SPARK-41360: For the block manager re-registration, we should only
allow it when
+ // the executor is recognized as active by the scheduler backend.
Otherwise, this kind
+ // of re-registration from the terminating/stopped executor is meaningless
and harmful.
+ lazy val isExecutorAlive =
+
driverEndpoint.askSync[Boolean](CoarseGrainedClusterMessages.IsExecutorAlive(id.executorId))
+ if (!blockManagerInfo.contains(id) && (!isReRegister || isExecutorAlive)) {
Review Comment:
> Wont the driver not remove in case of a heartbeat expiry even though the
executor did not disconnect ? (for the same reasons as above - long gc pause,
network partition, etc)
It will. But in these cases, the driver could fail to kill the executor.
(@mridulm No worries, take care:))
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]