[PR] HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race [hadoop]

via GitHub Sat, 07 Mar 2026 16:02:17 -0800


deepujain opened a new pull request, #8308:
URL: https://github.com/apache/hadoop/pull/8308


   ### Summary
   On a standby NameNode, a DataNode can get stuck in `DECOMMISSION_INPROGRESS` 
indefinitely when a timing race causes a new replica (created during 
re-replication) to be marked as **excess** instead of **live**. The standby's 
decommission monitor then sees too few "live" replicas and never considers the 
block sufficient, so decommission never completes. This branch merges 
[apache/hadoop#8295](https://github.com/apache/hadoop/pull/8295) with current 
trunk so the fix is up to date.
   
   ### Change
   - **DatanodeAdminManager.isSufficient()**: For non–under-construction 
blocks, count **excess** replicas together with live replicas when deciding if 
the block is sufficiently replicated for decommission. Retain the existing 
guard so decommission does not proceed when there are zero live replicas 
(`hasMinStorage(block, numLive)`).
   - **TestDatanodeAdminManagerIsSufficient**: New unit tests for the 
sufficiency logic (excess counts toward sufficiency, no live blocks 
decommission, etc.).
   
   ### JIRA
   Fixes HDFS-17722


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race [hadoop]

Reply via email to