[
https://issues.apache.org/jira/browse/HDFS-16989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
farmmamba updated HDFS-16989:
-----------------------------
Description:
Recently, we change the replication factor of a directory which has 1.6PB from
2 to 3. There are 76 million blocks in this directory. After we execute the
setrep cmd, the active namenode prints lots of logs like below:
{code:java}
PendingReconstructionMonitor timed out blk_xxxx_260285131{code}
and many datanodes prints lots of duplicated logs like below:
{code:java}
2023-04-21 13:58:17,627 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(1.1.1.1:50010,
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
infoSecurePort=0, ipcPort=8010,
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
to 2.2.2.2:50010
2023-04-21 14:21:21,296 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DataTransfer, at 1.1.1.1:50010: Transmitted
BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
(numBytes=524384907) to /2.2.2.2:50010
14:34:19,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(1.1.1.1:50010,
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
infoSecurePort=0, ipcPort=8010,
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
to 2.2.2.2:50010
2023-04-21 14:37:58,207 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(1.1.1.1:50010,
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
infoSecurePort=0, ipcPort=8010,
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
to 2.2.2.2:50010
14:40:46,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(1.1.1.1:50010,
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
infoSecurePort=0, ipcPort=8010,
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
to 2.2.2.2:50010
{code}
It is funny that an same block blk_1333463885_260285131 was transfer multiple
times even though it has been transmitted successfully. the excess transfer
request will trigger lots of ReplicaAlreadyExistsException in target datanode,
because the replica has already been transmitted and the state of the replica
is FINALIZED.
The reason is in RedundancyMonitor#processPendingReconstructions and
BlcokManager#validateReconstructionWork.
1、RedundancyMonitor#computeDatanodeWork() generates transfer tasks using
neededReconstruction and addTaskToDatanode, and put the tasks into
pendingReconstruction.
2、We set *dfs.namenode.replication.work.multiplier.per.iteration = 200,* the
specific cluster has 400 datanodes. So, RedundancyMonitor may generate 80K
block transfer tasks per iteration at most. After
dfs.namenode.reconstruction.pending.timeout-sec (5min), the requests in
pendingReconstruction will time-out, and PendingReconstructionMonitor thread
will put the time-out requests into timedOutItems.
3、RedundancyMonitor#processPendingReconstructions() will put the requests in
timedOutItems into neededReconstruction again.
4、TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs);
5、next iteration of while loop,RedundancyMonitor#computeDatanodeWork() will
generates transfer tasks using neededReconstruction again, here will generate
repeated task or different target node task(due to the chooseTarget method).
was:
Recently, we change the replication factor of a directory which has 1.6PB from
2 to 3. There are 76 million blocks in this directory. After we execute the
setrep cmd, the active namenode prints lots of logs like below:
{code:java}
PendingReconstructionMonitor timed out blk_xxxx_260285131{code}
and many datanodes prints lots of duplicated logs like below:
{code:java}
2023-04-21 13:58:17,627 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(1.1.1.1:50010,
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
infoSecurePort=0, ipcPort=8010,
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
to 2.2.2.2:50010
2023-04-21 14:21:21,296 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DataTransfer, at 1.1.1.1:50010: Transmitted
BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
(numBytes=524384907) to /2.2.2.2:50010
14:34:19,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(1.1.1.1:50010,
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
infoSecurePort=0, ipcPort=8010,
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
to 2.2.2.2:50010
2023-04-21 14:37:58,207 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(1.1.1.1:50010,
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
infoSecurePort=0, ipcPort=8010,
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
to 2.2.2.2:50010
14:40:46,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(1.1.1.1:50010,
datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
infoSecurePort=0, ipcPort=8010,
storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
to 2.2.2.2:50010
{code}
It is funny that an same block blk_1333463885_260285131 was transfer multiple
times even though it has been transmitted successfully. the excess transfer
request will trigger lots of ReplicaAlreadyExistsException in target datanode,
because the replica has already been transmitted and the state of the replica
is FINALIZED.
The reason is in RedundancyMonitor#processPendingReconstructions and
BlcokManager#validateReconstructionWork.
It can be reproduced by ...
> Large scale block transfer causes too many excess blocks
> --------------------------------------------------------
>
> Key: HDFS-16989
> URL: https://issues.apache.org/jira/browse/HDFS-16989
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.4.0, 3.3.5
> Reporter: farmmamba
> Priority: Critical
>
> Recently, we change the replication factor of a directory which has 1.6PB
> from 2 to 3. There are 76 million blocks in this directory. After we
> execute the setrep cmd, the active namenode prints lots of logs like below:
> {code:java}
> PendingReconstructionMonitor timed out blk_xxxx_260285131{code}
> and many datanodes prints lots of duplicated logs like below:
> {code:java}
> 2023-04-21 13:58:17,627 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(1.1.1.1:50010,
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
> infoSecurePort=0, ipcPort=8010,
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
> thread to transfer
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 2023-04-21 14:21:21,296 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DataTransfer, at 1.1.1.1:50010: Transmitted
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131
> (numBytes=524384907) to /2.2.2.2:50010
> 14:34:19,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(1.1.1.1:50010,
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
> infoSecurePort=0, ipcPort=8010,
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
> thread to transfer
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 2023-04-21 14:37:58,207 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(1.1.1.1:50010,
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
> infoSecurePort=0, ipcPort=8010,
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
> thread to transfer
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 14:40:46,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(1.1.1.1:50010,
> datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075,
> infoSecurePort=0, ipcPort=8010,
> storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting
> thread to transfer
> BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> {code}
> It is funny that an same block blk_1333463885_260285131 was transfer multiple
> times even though it has been transmitted successfully. the excess transfer
> request will trigger lots of ReplicaAlreadyExistsException in target
> datanode, because the replica has already been transmitted and the state of
> the replica is FINALIZED.
>
> The reason is in RedundancyMonitor#processPendingReconstructions and
> BlcokManager#validateReconstructionWork.
> 1、RedundancyMonitor#computeDatanodeWork() generates transfer tasks using
> neededReconstruction and addTaskToDatanode, and put the tasks into
> pendingReconstruction.
> 2、We set *dfs.namenode.replication.work.multiplier.per.iteration = 200,* the
> specific cluster has 400 datanodes. So, RedundancyMonitor may generate 80K
> block transfer tasks per iteration at most. After
> dfs.namenode.reconstruction.pending.timeout-sec (5min), the requests in
> pendingReconstruction will time-out, and PendingReconstructionMonitor thread
> will put the time-out requests into timedOutItems.
> 3、RedundancyMonitor#processPendingReconstructions() will put the requests in
> timedOutItems into neededReconstruction again.
> 4、TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs);
> 5、next iteration of while loop,RedundancyMonitor#computeDatanodeWork() will
> generates transfer tasks using neededReconstruction again, here will
> generate repeated task or different target node task(due to the chooseTarget
> method).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]