[
https://issues.apache.org/jira/browse/HDFS-16989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18032301#comment-18032301
]
ASF GitHub Bot commented on HDFS-16989:
---------------------------------------
github-actions[bot] closed pull request #5593: HDFS-16989. Large scale block
transfer causes too many excess blocks
URL: https://github.com/apache/hadoop/pull/5593
> Large scale block transfer causes too many excess blocks
> --------------------------------------------------------
>
> Key: HDFS-16989
> URL: https://issues.apache.org/jira/browse/HDFS-16989
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.4.0, 3.3.5
> Reporter: farmmamba
> Priority: Critical
> Labels: pull-request-available
>
> Recently, we change the replication factor of a directory which has 1.6PB
> from 2 to 3. There are 76 million blocks in this directory. After we
> execute the setrep cmd, the active namenode prints lots of logs like below:
> {code:java}
> PendingReconstructionMonitor timed out blk_xxxx_260285131{code}
> and many datanodes prints lots of duplicated logs like below:
> {code:java}
> 2023-04-21 13:58:17,627 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(1.1.1.1:50010,
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
> infoSecurePort=0, ipcPort=8010,
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
> thread to transfer
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 2023-04-21 14:21:21,296 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DataTransfer, at 1.1.1.1:50010: Transmitted
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
> (numBytes=524384907) to /2.2.2.2:50010
> 14:34:19,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(1.1.1.1:50010,
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
> infoSecurePort=0, ipcPort=8010,
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
> thread to transfer
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 2023-04-21 14:37:58,207 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(1.1.1.1:50010,
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
> infoSecurePort=0, ipcPort=8010,
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
> thread to transfer
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 14:40:46,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(1.1.1.1:50010,
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
> infoSecurePort=0, ipcPort=8010,
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
> thread to transfer
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> {code}
> It is funny that an same block blk_1333463885_260285131 was transfer multiple
> times even though it has been transmitted successfully. the excess transfer
> request will trigger lots of ReplicaAlreadyExistsException in target
> datanode, because the replica has already been transmitted and the state of
> the replica is FINALIZED.
>
> The reason is in RedundancyMonitor#processPendingReconstructions and
> BlcokManager#validateReconstructionWork.
> 1、RedundancyMonitor#computeDatanodeWork() generates transfer tasks using
> neededReconstruction and addTaskToDatanode, and put the tasks into
> pendingReconstruction.
> 2、We set *dfs.namenode.replication.work.multiplier.per.iteration = 200,* the
> specific cluster has 400 datanodes. So, RedundancyMonitor may generate 80K
> block transfer tasks per iteration at most. After
> dfs.namenode.reconstruction.pending.timeout-sec (5min), the requests in
> pendingReconstruction will time-out, and PendingReconstructionMonitor thread
> will put the time-out requests into timedOutItems.
> 3、RedundancyMonitor#processPendingReconstructions() will put the requests in
> timedOutItems into neededReconstruction again.
> 4、TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs);
> 5、next iteration of while loop,RedundancyMonitor#computeDatanodeWork() will
> generates transfer tasks using neededReconstruction again, here will
> generate repeated task or different target node task(due to the chooseTarget
> method).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]