[ 
https://issues.apache.org/jira/browse/HDFS-17036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuguanghua updated HDFS-17036:
-------------------------------
    Description: 
Consider the following scenario:

 (1) The hdfs cluster is big enough. 

Namenode Config: 

DFS_NAMENODE_REPLICATION_WORK_MULTIPLIER_PER_ITERATION ->200

DFS_NAMENODE_REPLICATION_MAX_STREAMS_KEY->100

DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_KEY->200

Datanode :

It has large disk on one datanode.  One datanode has millions or even tens of 
millions blocks.  Or maybe more in archive cluster.

(2) The cluster administrator may do one of the following actions:

2.1 Increase replication on one large dir ,eg 2PB 

2.2 Decommissioning  some nodes from cluster

(3)  When the cluster datanode is on heavy read and write.

 

Thus, datanode receive transfer commands form heartbeat whit namenode. When 
datanode is on heavily  load (write and read), the SumOfActorCommandQueueLength 
metris will increase to 1k+  because of slow consumption with lock competition. 
 See details code : BPServiceActor.CommandProcessingThread

 

And Namenode  distribute all replicationWork to datanode through heartbeat.  
All LowerReconstruction blocks will in neededReconstruction.  Every heartbeat 
will distribute 
liveDatnaodes*DFS_NAMENODE_REPLICATION_WORK_MULTIPLIER_PER_ITERATION to the 
related datanodes.  But as mentioned above,the datanode blocked executing the 
transfer block commands.  When 5min later by default ,  the 
pendingReconstruction will into timeout and then re-entry neededReconstruction, 
and then re-entry pendingReconstruction . So the datanodes will receive 
multiple transfer command with same block.  That will lead to excess blocks.

Continuous cycle,it will lead to serious consequences.  Datanode has more 
tranfers commands in queue and waste more disk space  and result in more excess 
blocks.

 

So, I think we should limit the max size of pendingReconstruction.

 

 

> Limit pendingReconstruction size
> --------------------------------
>
>                 Key: HDFS-17036
>                 URL: https://issues.apache.org/jira/browse/HDFS-17036
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>    Affects Versions: 3.4.0
>            Reporter: liuguanghua
>            Priority: Major
>
> Consider the following scenario:
>  (1) The hdfs cluster is big enough. 
> Namenode Config: 
> DFS_NAMENODE_REPLICATION_WORK_MULTIPLIER_PER_ITERATION ->200
> DFS_NAMENODE_REPLICATION_MAX_STREAMS_KEY->100
> DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_KEY->200
> Datanode :
> It has large disk on one datanode.  One datanode has millions or even tens of 
> millions blocks.  Or maybe more in archive cluster.
> (2) The cluster administrator may do one of the following actions:
> 2.1 Increase replication on one large dir ,eg 2PB 
> 2.2 Decommissioning  some nodes from cluster
> (3)  When the cluster datanode is on heavy read and write.
>  
> Thus, datanode receive transfer commands form heartbeat whit namenode. When 
> datanode is on heavily  load (write and read), the 
> SumOfActorCommandQueueLength metris will increase to 1k+  because of slow 
> consumption with lock competition.  See details code : 
> BPServiceActor.CommandProcessingThread
>  
> And Namenode  distribute all replicationWork to datanode through heartbeat.  
> All LowerReconstruction blocks will in neededReconstruction.  Every heartbeat 
> will distribute 
> liveDatnaodes*DFS_NAMENODE_REPLICATION_WORK_MULTIPLIER_PER_ITERATION to the 
> related datanodes.  But as mentioned above,the datanode blocked executing the 
> transfer block commands.  When 5min later by default ,  the 
> pendingReconstruction will into timeout and then re-entry 
> neededReconstruction, and then re-entry pendingReconstruction . So the 
> datanodes will receive multiple transfer command with same block.  That will 
> lead to excess blocks.
> Continuous cycle,it will lead to serious consequences.  Datanode has more 
> tranfers commands in queue and waste more disk space  and result in more 
> excess blocks.
>  
> So, I think we should limit the max size of pendingReconstruction.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to