Junegunn Choi created HDFS-17690:
------------------------------------
Summary: Avoid redundant EC reconstruction tasks after pending
reconstruction timeouts
Key: HDFS-17690
URL: https://issues.apache.org/jira/browse/HDFS-17690
Project: Hadoop HDFS
Issue Type: Improvement
Components: ec
Affects Versions: 3.4.1
Reporter: Junegunn Choi
Attachments: image-2024-12-18-13-50-11-978.png,
image-2024-12-18-13-51-33-552.png
h2. Problem
We are running HDFS clusters with RS-6-3-1024k policy enabled.
When datanodes go down, HDFS reconstructs the under-replicated EC blocks as
expected and the number of low redundancy blocks decreases. However, after some
period of time, we are observing a lot of wasted effort in reconstructing EC
blocks that are already fully reconstructed for a very long time.
!image-2024-12-18-13-50-11-978.png!
* We shutted down two datanodes from an HDFS cluster with 21 datanodes.
* Each datanode holds around 2TBs of data on SSD devices.
* It took 10 hours for the low redundancy metric to reach 0.
* Lots of (17K+) failed reconstruction tasks, wasting disk and network
resources during the period.
** (There were no other activities on the cluster)
This is the log messages for {{{}blk_-9223372036851532480_352520{}}}, and we
can see repeated {{"Failed to reconstruct"}} messages on different datanodes.
{noformat}
@timestamp,server,loglevel,role,data
Dec 18, 2024 @ 00:56:07.000,s19,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 01:03:41.000,s04,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 04:56:16.000,s20,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 05:18:37.000,s20,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 07:18:31.000,s20,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 07:38:34.000,s20,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:18:28.000,s19,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:18:40.000,s19,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:19:07.000,s19,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:23:08.000,s18,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
Dec 18, 2024 @ 08:23:38.000,s18,WARN,datanode,Failed to reconstruct striped
block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
{noformat}
The repeated failures were caused by {{{}ReplicaAlreadyExistsException{}}},
which means we were performing redundant tasks.
{noformat}
24/12/18 08:23:59 INFO datanode.DataNode: DatanodeCommand action:
DNA_ERASURE_CODING_RECOVERY
24/12/18 08:23:59 INFO datanode.DataNode: Receiving
BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 src: /s18:35816
dest: /s18:1004
24/12/18 08:23:59 INFO datanode.DataNode: opWriteBlock
BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 received
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
Block BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already
exists in state FINALIZED and thus cannot be created.
24/12/18 08:23:59 INFO datanode.DataNode: s18:1004:DataXceiver error processing
WRITE_BLOCK operation src: /s18:35816 dst: /s18:1004;
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already exists
in state FINALIZED and thus cannot be created.
24/12/18 08:23:59 WARN datanode.DataNode: Broken pipe
24/12/18 08:23:59 WARN datanode.DataNode: Failed to reconstruct striped block:
BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
java.io.IOException: Transfer failed for all targets.
at
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:118)
at
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:63)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
{{hdfs fsck -blockId blk_-9223372036851532480}} also shows that this block
doesn't need reconstruction.
{noformat}
No. of Expected Replica: 9
No. of live Replica: 9
{noformat}
h2. Workaround
Because this happens only after several {{PendingReconstructionMonitor timed
out}} messages, I tried increasing
{{dfs.namenode.reconstruction.pending.timeout-sec}} so no timeout occurs, and
the problem disappeared.
h2. Cause
When an EC reconstruction task for a block is scheduled,
{{BlockECReconstructionInfo}} for the block is added to the
{{erasurecodeBlocks}} queue of the {{DatanodeDescriptor}} for the target node.
However, it is not removed from the queue when the reconstruction task is timed
out, recreated, and rescheduled. And we end up with multiple redundant
{{BlockECReconstructionInfo}} objects on the queues of several datanodes.
This explains a lot of wasted effort after multiple
PendingReconstructionMonitor timeouts.
You're more likely to experience this problem on a relatively small cluster
with with small {{dfs.namenode.replication.max-streams}} value, where block
reconstruction for failed datanodes would take considerable amount of time and
trigger many pending reconstruction timeouts. This was exactly the case for us;
our cluster has only 21 datanodes and {{dfs.namenode.replication.max-streams}}
is set to 1 because we're running HBase on the same cluster and don't want
eager reconstruction to affect HBase's performance.
h2. Suggested fix
Avoid processing redundant {{BlockECReconstructionInfo}} from the queue. Check
again if the task is really needed just before we dispatch the tasks to the
datanodes.
h2. Result
!image-2024-12-18-13-51-33-552.png!
* Took down 2 datanodes
* Only took 3.5 hours to complete the reconstruction
* 10K+ {{"doesn't need reconstruction"}} messages in the namenode log, showing
that we have successfully avoided dispatching redundant tasks.
* No {{org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException}}
errors
h2. Alternative approach considered
We can instead remove the {{BlockECReconstructionInfo}} from the queue when the
task is timed out, but this would require more bookkeeping and is more
error-prone.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]