[
https://issues.apache.org/jira/browse/HDFS-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021988#comment-18021988
]
ASF GitHub Bot commented on HDFS-17569:
---------------------------------------
github-actions[bot] commented on PR #6924:
URL: https://github.com/apache/hadoop/pull/6924#issuecomment-3321969153
We're closing this stale PR because it has been open for 100 days with no
activity. This isn't a judgement on the merit of the PR in any way. It's just a
way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working
on it, please feel free to re-open it and ask for a committer to remove the
stale tag and review again.
Thanks all for your contribution.
> Setup Effective Work Number when Generating Block Reconstruction Work
> ---------------------------------------------------------------------
>
> Key: HDFS-17569
> URL: https://issues.apache.org/jira/browse/HDFS-17569
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: wuchang
> Priority: Major
> Labels: pull-request-available
>
> h1. Description of PR
> The *{{RedundancyMonitor}}* is a Daemon which will sleep *3s* for each tick.
> In {{{}*computeBlockReconstructionWork(int blocksToProcess)*{}}}, it will use
> *{{chooseLowRedundancyBlocks()}}* to firstly find the candidate
> low-redundancy blocks(the max number is parameter *{{blocksToProcess}}* ) and
> then use {{*computeReconstructionWorkForBlocks()*}} to compute the
> reconstruction.
> But in some cases, the candidate low-redundancy blocks will be skipped for
> reconstruction scheduling for different reasons(source unavailable, target
> not found, validation failed, etc), but some other low-priority blocks which
> is able for reconstruction has to wait for many other rounds before scheduled.
> h1. How it happened in my case
> In my case, I have a 7 datanodes cluster({{{}a1 ~ a5{}}}, {{{}b1 ~ b2{}}}))
> and I want to add a new datanode {{b3}} to it and at the same time
> decommission {{{}a1 ~ a5{}}}.
> I find that the decommission takes one week to finish(less than 10000 blocks
> on each node). and I find below logs:
> {code:java}
> 2024-07-02 03:14:48,166 DEBUG
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block
> blk_1073806598_65774 cannot be reconstructed from any node
> 2024-07-02 03:14:48,166 DEBUG
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block
> blk_1073806599_65775 cannot be reconstructed from any node
> 2024-07-02 03:14:48,166 DEBUG
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block
> blk_1073806600_65776 cannot be reconstructed from any node
> 2024-07-02 03:14:48,166 DEBUG
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block
> blk_1073806603_65779 cannot be reconstructed from any node{code}
> These blocks cannot be scheduled for reconstruction, but used up the quota
> ({{{}blocksToProcess{}}}) in each tick, delayed the replicas on the
> decommissioning nodes to be scheduled for reconstruction, thus the
> decommission becomes very long-tail.
>
> h1. My Solution:
> So my solution is, when we meet the low-redundancy blocks to be skipped for
> reconstruction, move fast-forward inside current tick to check other
> low-redundancy blocks to schedule reconstruction for them.
> Of course, the total number of blocks scheduled for reconstruction
> successfully will be restricted by parameter {*}blocksToProcess{*}.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]