[
https://issues.apache.org/jira/browse/HDFS-16601?focusedWorklogId=779109&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779109
]
ASF GitHub Bot logged work on HDFS-16601:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 07/Jun/22 14:09
Start Date: 07/Jun/22 14:09
Worklog Time Spent: 10m
Work Description: ZanderXu commented on PR #4369:
URL: https://github.com/apache/hadoop/pull/4369#issuecomment-1148728940
Thanks @Hexiaoqiao .
When client is recovering pipeline, the source dn of selected to transfer
block to new DN may be abnormal, so that the source dn cannot transfer the
block to the new node normally, but the failed exception not returned to the
client, caused the client to think that the transfer is completed
successfully. Because new DN not contains the block, so client will fail to
build the pipeline and mark the new DN is bad. And then Client will add the new
DN into exclude list to get a new DN for the new loop pipeline recovery.
The new pipeline recovery will still choose the abnormal dn as source dn to
transfer block, and will failed again..
So Dn should return the failed transfer exception to client, so that client
can choose anther existed dn as source dn to transfer the block to new DN.
Issue Time Tracking
-------------------
Worklog Id: (was: 779109)
Time Spent: 50m (was: 40m)
> Failed to replace a bad datanode on the existing pipeline due to no more good
> datanodes being available to try
> --------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-16601
> URL: https://issues.apache.org/jira/browse/HDFS-16601
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: ZanderXu
> Assignee: ZanderXu
> Priority: Major
> Labels: pull-request-available
> Time Spent: 50m
> Remaining Estimate: 0h
>
> In our production environment, we found a bug and stack like:
> {code:java}
> java.io.IOException: Failed to replace a bad datanode on the existing
> pipeline due to no more good datanodes being available to try. (Nodes:
> current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK],
>
> DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]],
>
> original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK],
>
> DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]).
> The current failed datanode replacement policy is DEFAULT, and a client may
> configure this via
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its
> configuration.
> at
> org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
> at
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
> at
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
> at
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> And the root cause is that DFSClient cannot perceive the exception of
> TransferBlock during PipelineRecovery. If failed during TransferBlock, the
> DFSClient will retry all datanodes in the cluster and then failed.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]