[ https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551381#comment-17551381 ]
caozhiqiang commented on HDFS-16613: ------------------------------------ [~hadachi] , in my cluster, dfs.namenode.replication.max-streams-hard-limit=512, dfs.namenode.replication.work.multiplier.per.iteration=20. The data process is below: # Choose the blocks to be reconstructed from neededReconstruction. This process use dfs.namenode.replication.work.multiplier.per.iteration to limit process number. # *Choose source datanode. This process use dfs.namenode.replication.max-streams-hard-limit to limit process number.* # Choose target datanode. # Add task to datanode. # The blocks to be replicated would put to pendingReconstruction. If blocks in pendingReconstruction timeout, they will be put back to neededReconstruction and continue process. *This process use dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.* # *Send cmd to dn in heartbeat response. Use dfs.namenode.decommission.max-streams to limit task number original.* Firstly, the process 1 doesn't have performance bottleneck. Performance bottleneck is in process 2, 5 and 6. So we should increase the value of dfs.namenode.replication.max-streams-hard-limit and decrease the value of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we should use dfs.namenode.replication.max-streams-hard-limit to limit the task number. {code:java} // DatanodeManager::handleHeartbeat if (nodeinfo.isDecommissionInProgress()) { maxTransfers = blockManager.getReplicationStreamsHardLimit() - xmitsInProgress; } else { maxTransfers = blockManager.getMaxReplicationStreams() - xmitsInProgress; } {code} The below graph with under replicated blocks and pending replicated blocks metrics monitor, which can show the performance bottleneck. A lot of blocks time out in pendingReconstruction and were put back to neededReconstruction repeatedly. The first graph is before optimization and the second is after optimization. Please help to check this process, thank you. !image-2022-06-08-11-41-11-127.png|width=932,height=190! !image-2022-06-08-11-38-29-664.png|width=931,height=175! > EC: Improve performance of decommissioning dn with many ec blocks > ----------------------------------------------------------------- > > Key: HDFS-16613 > URL: https://issues.apache.org/jira/browse/HDFS-16613 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding, namenode > Affects Versions: 3.4.0 > Reporter: caozhiqiang > Assignee: caozhiqiang > Priority: Major > Labels: pull-request-available > Attachments: image-2022-06-07-11-46-42-389.png, > image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, > image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png, > image-2022-06-08-11-38-29-664.png, image-2022-06-08-11-41-11-127.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. > The reason is unlike replication blocks can be replicated from any dn which > has the same block replication, the ec block have to be replicated from the > decommissioning dn. > The configurations dfs.namenode.replication.max-streams and > dfs.namenode.replication.max-streams-hard-limit will limit the replication > speed, but increase these configurations will create risk to the whole > cluster's network. So it should add a new configuration to limit the > decommissioning dn, distinguished from the cluster wide max-streams limit. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org