ukby1234 commented on code in PR #42155:
URL: https://github.com/apache/spark/pull/42155#discussion_r1304947109
##########
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:
##########
@@ -207,7 +207,7 @@ private[storage] class BlockManagerDecommissioner(
logInfo("Attempting to migrate all RDD blocks")
while (!stopped && !stoppedRDD) {
// Validate if we have peers to migrate to. Otherwise, give up
migration.
- if (bm.getPeers(false).isEmpty) {
+ if (!bm.getPeers(false).exists(x => x !=
FallbackStorage.FALLBACK_BLOCK_MANAGER_ID)) {
Review Comment:
Will open a separate PR to address the migration to S3, but the main focus
for this PR is to fix bugs that will cause decommissioned executors stay alive
forever, and create a deadlock in some cases. This causes a deadlock in the
following situation:
1. All executors are marked as decommissioned, and we have a cap on number
of executors
2. The migration of RDD blocks will never finish because of fallback storage
3. New tasks are launched and there are no executors eligible to run them
due to decommission and no new executors can be launched because of the max is
reached.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]