[
https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551381#comment-17551381
]
caozhiqiang edited comment on HDFS-16613 at 6/8/22 3:58 AM:
------------------------------------------------------------
[~hadachi] , in my cluster,
dfs.namenode.replication.max-streams-hard-limit=512,
dfs.namenode.replication.work.multiplier.per.iteration=20.
The data process is below:
# Choose the blocks to be reconstructed from neededReconstruction. This
process use dfs.namenode.replication.work.multiplier.per.iteration to limit
process number.
# *Choose source datanode. This process use
dfs.namenode.replication.max-streams-hard-limit to limit process number.*
# Choose target datanode.
# Add task to datanode.
# The blocks to be replicated would put to pendingReconstruction. If blocks in
pendingReconstruction timeout, they will be put back to neededReconstruction
and continue process. *This process use
dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.*
# *Send replication cmds to dn in heartbeat response. Use
dfs.namenode.decommission.max-streams to limit task number original.*
Firstly, the process 1 doesn't have performance bottleneck. And its process
interval is 3 seconds.
Performance bottleneck is in process 2, 5 and 6. So we should increase the
value of dfs.namenode.replication.max-streams-hard-limit and decrease the value
of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we
should change to use dfs.namenode.replication.max-streams-hard-limit to limit
the task number.
{code:java}
// DatanodeManager::handleHeartbeat
if (nodeinfo.isDecommissionInProgress()) {
maxTransfers = blockManager.getReplicationStreamsHardLimit()
- xmitsInProgress;
} else {
maxTransfers = blockManager.getMaxReplicationStreams()
- xmitsInProgress;
} {code}
*In other words, we should get blocks from pendingReconstruction to
neededReconstruction in shorter interval(process 5). And seed more replication
tasks to datanode(process 2 and 6).*
The below graph with under_replicated_blocks and pending_replicated_blocks
metrics monitor in namenode, which can show the performance bottleneck. A lot
of blocks time out in pendingReconstruction and would be put back to
neededReconstruction repeatedly. The first graph is before optimization and the
second is after optimization.
Please help to check this process, thank you.
!image-2022-06-08-11-41-11-127.png|width=932,height=190!
!image-2022-06-08-11-38-29-664.png|width=931,height=175!
was (Author: caozhiqiang):
[~hadachi] , in my cluster,
dfs.namenode.replication.max-streams-hard-limit=512,
dfs.namenode.replication.work.multiplier.per.iteration=20.
The data process is below:
# Choose the blocks to be reconstructed from neededReconstruction. This
process use dfs.namenode.replication.work.multiplier.per.iteration to limit
process number.
# *Choose source datanode. This process use
dfs.namenode.replication.max-streams-hard-limit to limit process number.*
# Choose target datanode.
# Add task to datanode.
# The blocks to be replicated would put to pendingReconstruction. If blocks in
pendingReconstruction timeout, they will be put back to neededReconstruction
and continue process. *This process use
dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.*
# *Send cmd to dn in heartbeat response. Use
dfs.namenode.decommission.max-streams to limit task number original.*
Firstly, the process 1 doesn't have performance bottleneck. And its process
interval is 3 seconds.
Performance bottleneck is in process 2, 5 and 6. So we should increase the
value of dfs.namenode.replication.max-streams-hard-limit and decrease the value
of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we
should use dfs.namenode.replication.max-streams-hard-limit to limit the task
number.
That mean we should take blocks from pendingReconstruction to
neededReconstruction in shorten interval(process 5). And seed more replication
tasks to datanode(process 2 and 6).
{code:java}
// DatanodeManager::handleHeartbeat
if (nodeinfo.isDecommissionInProgress()) {
maxTransfers = blockManager.getReplicationStreamsHardLimit()
- xmitsInProgress;
} else {
maxTransfers = blockManager.getMaxReplicationStreams()
- xmitsInProgress;
} {code}
The below graph with under replicated blocks and pending replicated blocks
metrics monitor, which can show the performance bottleneck. A lot of blocks
time out in pendingReconstruction and were put back to neededReconstruction
repeatedly. The first graph is before optimization and the second is after
optimization.
Please help to check this process, thank you.
!image-2022-06-08-11-41-11-127.png|width=932,height=190!
!image-2022-06-08-11-38-29-664.png|width=931,height=175!
> EC: Improve performance of decommissioning dn with many ec blocks
> -----------------------------------------------------------------
>
> Key: HDFS-16613
> URL: https://issues.apache.org/jira/browse/HDFS-16613
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: ec, erasure-coding, namenode
> Affects Versions: 3.4.0
> Reporter: caozhiqiang
> Assignee: caozhiqiang
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2022-06-07-11-46-42-389.png,
> image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png,
> image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png,
> image-2022-06-08-11-38-29-664.png, image-2022-06-08-11-41-11-127.png
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow.
> The reason is unlike replication blocks can be replicated from any dn which
> has the same block replication, the ec block have to be replicated from the
> decommissioning dn.
> The configurations dfs.namenode.replication.max-streams and
> dfs.namenode.replication.max-streams-hard-limit will limit the replication
> speed, but increase these configurations will create risk to the whole
> cluster's network. So it should add a new configuration to limit the
> decommissioning dn, distinguished from the cluster wide max-streams limit.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]