[ 
https://issues.apache.org/jira/browse/HDFS-16989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-16989:
-----------------------------
    Description: 
Recently, we change the replication factor of a directory which has 1.6PB  from 
2 to 3.  There are 76 million  blocks in this directory. After we execute the 
setrep cmd, the active namenode prints lots of logs like below:
{code:java}
PendingReconstructionMonitor timed out blk_xxxx_260285131{code}
and many datanodes prints lots of duplicated logs like below: 
{code:java}
2023-04-21 13:58:17,627 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(1.1.1.1:50010, 
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
infoSecurePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
to 2.2.2.2:50010 
2023-04-21 14:21:21,296 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DataTransfer, at 1.1.1.1:50010: Transmitted 
BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
(numBytes=524384907) to /2.2.2.2:50010
14:34:19,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(1.1.1.1:50010, 
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
infoSecurePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
to 2.2.2.2:50010
2023-04-21 14:37:58,207 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(1.1.1.1:50010, 
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
infoSecurePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
to 2.2.2.2:50010
14:40:46,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(1.1.1.1:50010, 
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
infoSecurePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
to 2.2.2.2:50010
 {code}
It is funny that an same block blk_1333463885_260285131 was transfer multiple 
times even though it has been transmitted successfully. the excess transfer 
request will trigger lots of ReplicaAlreadyExistsException in target datanode, 
because the replica has already been transmitted and the state of the replica 
is FINALIZED.

 

The reason is in   RedundancyMonitor#processPendingReconstructions and 
BlcokManager#validateReconstructionWork.

1、RedundancyMonitor#computeDatanodeWork() generates transfer tasks using 
neededReconstruction and addTaskToDatanode, and put the tasks into 
pendingReconstruction.

2、We set *dfs.namenode.replication.work.multiplier.per.iteration = 200,*  the 
specific cluster has 400 datanodes. So, RedundancyMonitor may generate 80K 
block transfer tasks per iteration at most.  After 
dfs.namenode.reconstruction.pending.timeout-sec (5min), the requests in  
pendingReconstruction will time-out, and  PendingReconstructionMonitor thread 
will put the time-out requests into timedOutItems.

3、RedundancyMonitor#processPendingReconstructions() will put the requests in 
timedOutItems into neededReconstruction again.

4、TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs);

5、next iteration of while loop,RedundancyMonitor#computeDatanodeWork() will 
generates transfer tasks using neededReconstruction again,  here will generate 
repeated task or different target node task(due to the chooseTarget method).

  was:
Recently, we change the replication factor of a directory which has 1.6PB  from 
2 to 3.  There are 76 million  blocks in this directory. After we execute the 
setrep cmd, the active namenode prints lots of logs like below:
{code:java}
PendingReconstructionMonitor timed out blk_xxxx_260285131{code}
and many datanodes prints lots of duplicated logs like below: 
{code:java}
2023-04-21 13:58:17,627 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(1.1.1.1:50010, 
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
infoSecurePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
to 2.2.2.2:50010 
2023-04-21 14:21:21,296 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DataTransfer, at 1.1.1.1:50010: Transmitted 
BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
(numBytes=524384907) to /2.2.2.2:50010
14:34:19,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(1.1.1.1:50010, 
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
infoSecurePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
to 2.2.2.2:50010
2023-04-21 14:37:58,207 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(1.1.1.1:50010, 
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
infoSecurePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
to 2.2.2.2:50010
14:40:46,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(1.1.1.1:50010, 
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
infoSecurePort=0, ipcPort=8010, 
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
to 2.2.2.2:50010
 {code}
It is funny that an same block blk_1333463885_260285131 was transfer multiple 
times even though it has been transmitted successfully. the excess transfer 
request will trigger lots of ReplicaAlreadyExistsException in target datanode, 
because the replica has already been transmitted and the state of the replica 
is FINALIZED.

 

The reason is in   RedundancyMonitor#processPendingReconstructions and 
BlcokManager#validateReconstructionWork.

 

It can be reproduced by ...


> Large scale block transfer causes too many excess blocks
> --------------------------------------------------------
>
>                 Key: HDFS-16989
>                 URL: https://issues.apache.org/jira/browse/HDFS-16989
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.4.0, 3.3.5
>            Reporter: farmmamba
>            Priority: Critical
>
> Recently, we change the replication factor of a directory which has 1.6PB  
> from 2 to 3.  There are 76 million  blocks in this directory. After we 
> execute the setrep cmd, the active namenode prints lots of logs like below:
> {code:java}
> PendingReconstructionMonitor timed out blk_xxxx_260285131{code}
> and many datanodes prints lots of duplicated logs like below: 
> {code:java}
> 2023-04-21 13:58:17,627 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(1.1.1.1:50010, 
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
> infoSecurePort=0, ipcPort=8010, 
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
> thread to transfer 
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010 
> 2023-04-21 14:21:21,296 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DataTransfer, at 1.1.1.1:50010: Transmitted 
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 
> (numBytes=524384907) to /2.2.2.2:50010
> 14:34:19,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(1.1.1.1:50010, 
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
> infoSecurePort=0, ipcPort=8010, 
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
> thread to transfer 
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 2023-04-21 14:37:58,207 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(1.1.1.1:50010, 
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
> infoSecurePort=0, ipcPort=8010, 
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
> thread to transfer 
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 14:40:46,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(1.1.1.1:50010, 
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, 
> infoSecurePort=0, ipcPort=8010, 
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting 
> thread to transfer 
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
>  {code}
> It is funny that an same block blk_1333463885_260285131 was transfer multiple 
> times even though it has been transmitted successfully. the excess transfer 
> request will trigger lots of ReplicaAlreadyExistsException in target 
> datanode, because the replica has already been transmitted and the state of 
> the replica is FINALIZED.
>  
> The reason is in   RedundancyMonitor#processPendingReconstructions and 
> BlcokManager#validateReconstructionWork.
> 1、RedundancyMonitor#computeDatanodeWork() generates transfer tasks using 
> neededReconstruction and addTaskToDatanode, and put the tasks into 
> pendingReconstruction.
> 2、We set *dfs.namenode.replication.work.multiplier.per.iteration = 200,*  the 
> specific cluster has 400 datanodes. So, RedundancyMonitor may generate 80K 
> block transfer tasks per iteration at most.  After 
> dfs.namenode.reconstruction.pending.timeout-sec (5min), the requests in  
> pendingReconstruction will time-out, and  PendingReconstructionMonitor thread 
> will put the time-out requests into timedOutItems.
> 3、RedundancyMonitor#processPendingReconstructions() will put the requests in 
> timedOutItems into neededReconstruction again.
> 4、TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs);
> 5、next iteration of while loop,RedundancyMonitor#computeDatanodeWork() will 
> generates transfer tasks using neededReconstruction again,  here will 
> generate repeated task or different target node task(due to the chooseTarget 
> method).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to