[jira] [Commented] (HDFS-17569) Setup Effective Work Number when Generating Block Reconstruction Work

ASF GitHub Bot (Jira) Tue, 23 Sep 2025 17:23:17 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18022290#comment-18022290
 ]


ASF GitHub Bot commented on HDFS-17569:
---------------------------------------

github-actions[bot] closed pull request #6924: HDFS-17569 Change Code Logic for 
Generating Block Reconstruction Work
URL: https://github.com/apache/hadoop/pull/6924




> Setup Effective Work Number when Generating Block Reconstruction Work
> ---------------------------------------------------------------------
>
>                 Key: HDFS-17569
>                 URL: https://issues.apache.org/jira/browse/HDFS-17569
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: wuchang
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Description of PR
> The *{{RedundancyMonitor}}* is a Daemon which will sleep *3s* for each tick.
> In {{{}*computeBlockReconstructionWork(int blocksToProcess)*{}}}, it will use 
> *{{chooseLowRedundancyBlocks()}}* to firstly find the candidate 
> low-redundancy blocks(the max number is parameter *{{blocksToProcess}}* ) and 
> then use {{*computeReconstructionWorkForBlocks()*}} to compute the 
> reconstruction.
> But in some cases, the candidate low-redundancy blocks will be skipped for 
> reconstruction scheduling for different reasons(source unavailable, target 
> not found, validation failed, etc), but some other low-priority blocks which 
> is able for reconstruction has to wait for many other rounds before scheduled.
> h1. How it happened in my case
> In my case, I have a 7 datanodes cluster({{{}a1 ~ a5{}}}, {{{}b1 ~ b2{}}})) 
> and I want to add a new datanode {{b3}} to it and at the same time 
> decommission {{{}a1 ~ a5{}}}.
> I find that the decommission takes one week to finish(less than 10000 blocks 
> on each node). and I find below logs:
> {code:java}
> 2024-07-02 03:14:48,166 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block 
> blk_1073806598_65774 cannot be reconstructed from any node 
> 2024-07-02 03:14:48,166 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block 
> blk_1073806599_65775 cannot be reconstructed from any node 
> 2024-07-02 03:14:48,166 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block 
> blk_1073806600_65776 cannot be reconstructed from any node 
> 2024-07-02 03:14:48,166 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block 
> blk_1073806603_65779 cannot be reconstructed from any node{code}
> These blocks cannot be scheduled for reconstruction, but used up the quota 
> ({{{}blocksToProcess{}}}) in each tick, delayed the replicas on the 
> decommissioning nodes to be scheduled for reconstruction, thus the 
> decommission becomes very long-tail.
>  
> h1. My Solution:
> So my solution is, when we meet the low-redundancy blocks to be skipped for 
> reconstruction, move fast-forward inside current tick to check other 
> low-redundancy blocks to schedule reconstruction for them. 
> Of course, the total number of blocks scheduled for reconstruction 
> successfully will be restricted by parameter {*}blocksToProcess{*}.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17569) Setup Effective Work Number when Generating Block Reconstruction Work

Reply via email to