[GitHub] [spark] ukby1234 commented on a diff in pull request #42155: [SPARK-44547][CORE] Ignore fallback storage for cached RDD migration

via GitHub Thu, 24 Aug 2023 15:53:21 -0700


ukby1234 commented on code in PR #42155:
URL: https://github.com/apache/spark/pull/42155#discussion_r1304947109



##########
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:
##########
@@ -207,7 +207,7 @@ private[storage] class BlockManagerDecommissioner(
       logInfo("Attempting to migrate all RDD blocks")
       while (!stopped && !stoppedRDD) {
         // Validate if we have peers to migrate to. Otherwise, give up 
migration.
-        if (bm.getPeers(false).isEmpty) {
+        if (!bm.getPeers(false).exists(x => x != 
FallbackStorage.FALLBACK_BLOCK_MANAGER_ID)) {

Review Comment:
   Will open a separate PR to address the migration to S3, but the main focus 
for this PR is to fix bugs that will cause decommissioned executors stay alive 
forever, and create a deadlock in some cases. This causes a deadlock in the 
following situation: 
   1. All executors are marked as decommissioned, and we have a cap on number 
of executors
   2. The migration of RDD blocks will never finish because of fallback storage
   3. New tasks are launched and there are no executors eligible to run them 
due to decommission and no new executors can be launched because of the max is 
reached.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ukby1234 commented on a diff in pull request #42155: [SPARK-44547][CORE] Ignore fallback storage for cached RDD migration

Reply via email to