[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks

caozhiqiang (Jira) Tue, 07 Jun 2022 20:59:08 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551381#comment-17551381
 ]


caozhiqiang edited comment on HDFS-16613 at 6/8/22 3:58 AM:
------------------------------------------------------------

[~hadachi] , in my cluster, 
dfs.namenode.replication.max-streams-hard-limit=512, 
dfs.namenode.replication.work.multiplier.per.iteration=20.

The data process is below:
 # Choose the blocks to be reconstructed from neededReconstruction. This 
process use dfs.namenode.replication.work.multiplier.per.iteration to limit 
process number.
 # *Choose source datanode. This process use 
dfs.namenode.replication.max-streams-hard-limit to limit process number.*
 # Choose target datanode.
 # Add task to datanode.
 # The blocks to be replicated would put to pendingReconstruction. If blocks in 
pendingReconstruction timeout, they will be put back to neededReconstruction 
and continue process. *This process use 
dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.*
 # *Send replication cmds to dn in heartbeat response. Use 
dfs.namenode.decommission.max-streams to limit task number original.*

Firstly, the process 1 doesn't have performance bottleneck. And its process 
interval is 3 seconds.

Performance bottleneck is in process 2, 5 and 6. So we should increase the 
value of dfs.namenode.replication.max-streams-hard-limit and decrease the value 
of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we 
should change to use dfs.namenode.replication.max-streams-hard-limit to limit 
the task number.
{code:java}
// DatanodeManager::handleHeartbeat
      if (nodeinfo.isDecommissionInProgress()) {
        maxTransfers = blockManager.getReplicationStreamsHardLimit()
            - xmitsInProgress;
      } else {
        maxTransfers = blockManager.getMaxReplicationStreams()
            - xmitsInProgress;
      } {code}
*In other words, we should get blocks from pendingReconstruction to 
neededReconstruction in shorter interval(process 5). And seed more replication 
tasks to datanode(process 2 and 6).*

The below graph with under_replicated_blocks and pending_replicated_blocks 
metrics monitor in namenode, which can show the performance bottleneck. A lot 
of blocks time out in pendingReconstruction and would be put back to 
neededReconstruction repeatedly. The first graph is before optimization and the 
second is after optimization.

Please help to check this process, thank you.

 

!image-2022-06-08-11-41-11-127.png|width=932,height=190!

!image-2022-06-08-11-38-29-664.png|width=931,height=175!


was (Author: caozhiqiang):
[~hadachi] , in my cluster, 
dfs.namenode.replication.max-streams-hard-limit=512, 
dfs.namenode.replication.work.multiplier.per.iteration=20.

The data process is below:
 # Choose the blocks to be reconstructed from neededReconstruction. This 
process use dfs.namenode.replication.work.multiplier.per.iteration to limit 
process number.
 # *Choose source datanode. This process use 
dfs.namenode.replication.max-streams-hard-limit to limit process number.*
 # Choose target datanode.
 # Add task to datanode.
 # The blocks to be replicated would put to pendingReconstruction. If blocks in 
pendingReconstruction timeout, they will be put back to neededReconstruction 
and continue process. *This process use 
dfs.namenode.reconstruction.pending.timeout-sec to limit time interval.*
 # *Send cmd to dn in heartbeat response. Use 
dfs.namenode.decommission.max-streams to limit task number original.*

Firstly, the process 1 doesn't have performance bottleneck. And its process 
interval is 3 seconds.

Performance bottleneck is in process 2, 5 and 6. So we should increase the 
value of dfs.namenode.replication.max-streams-hard-limit and decrease the value 
of dfs.namenode.reconstruction.pending.timeout-sec{*}.{*} With process 6, we 
should use dfs.namenode.replication.max-streams-hard-limit to limit the task 
number.

That mean we should take blocks from pendingReconstruction to 
neededReconstruction in shorten interval(process 5). And seed more replication 
tasks to datanode(process 2 and 6).
{code:java}
// DatanodeManager::handleHeartbeat
      if (nodeinfo.isDecommissionInProgress()) {
        maxTransfers = blockManager.getReplicationStreamsHardLimit()
            - xmitsInProgress;
      } else {
        maxTransfers = blockManager.getMaxReplicationStreams()
            - xmitsInProgress;
      } {code}
The below graph with under replicated blocks and pending replicated blocks 
metrics monitor, which can show the performance bottleneck. A lot of blocks 
time out in pendingReconstruction and were put back to neededReconstruction 
repeatedly. The first graph is before optimization and the second is after 
optimization.

Please help to check this process, thank you.

 

!image-2022-06-08-11-41-11-127.png|width=932,height=190!

!image-2022-06-08-11-38-29-664.png|width=931,height=175!

> EC: Improve performance of decommissioning dn with many ec blocks
> -----------------------------------------------------------------
>
>                 Key: HDFS-16613
>                 URL: https://issues.apache.org/jira/browse/HDFS-16613
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ec, erasure-coding, namenode
>    Affects Versions: 3.4.0
>            Reporter: caozhiqiang
>            Assignee: caozhiqiang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2022-06-07-11-46-42-389.png, 
> image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, 
> image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png, 
> image-2022-06-08-11-38-29-664.png, image-2022-06-08-11-41-11-127.png
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. 
> The reason is unlike replication blocks can be replicated from any dn which 
> has the same block replication, the ec block have to be replicated from the 
> decommissioning dn.
> The configurations dfs.namenode.replication.max-streams and 
> dfs.namenode.replication.max-streams-hard-limit will limit the replication 
> speed, but increase these configurations will create risk to the whole 
> cluster's network. So it should add a new configuration to limit the 
> decommissioning dn, distinguished from the cluster wide max-streams limit.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks

Reply via email to