Ngone51 commented on code in PR #38876:
URL: https://github.com/apache/spark/pull/38876#discussion_r1039195161


##########
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##########
@@ -583,7 +586,12 @@ class BlockManagerMasterEndpoint(
 
     val time = System.currentTimeMillis()
     executorIdToLocalDirs.put(id.executorId, localDirs)
-    if (!blockManagerInfo.contains(id)) {
+    // SPARK-41360: For the block manager re-registration, we should only 
allow it when
+    // the executor is recognized as active by the scheduler backend. 
Otherwise, this kind
+    // of re-registration from the terminating/stopped executor is meaningless 
and harmful.
+    lazy val isExecutorAlive =
+      
driverEndpoint.askSync[Boolean](CoarseGrainedClusterMessages.IsExecutorAlive(id.executorId))
+    if (!blockManagerInfo.contains(id) && (!isReRegister || isExecutorAlive)) {

Review Comment:
   > ShutdownHookManager.inShutdown should tell if it is in the process of 
shutting down - and prevent call to reregister ?
   
   Ok, i see.
   
   > For the other cases, lost due to long GC, lost due to network partitions, 
etc - they are legitimate candidates for registration.
   
   Note that there's a prerequisite of the re-registration in this PR that the 
executor should already be lost in the driver's view. In that case, block 
manager re-registration is meaningless since the executor won't reconnect to 
the driver.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to