[
https://issues.apache.org/jira/browse/HDFS-16601?focusedWorklogId=779997&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779997
]
ASF GitHub Bot logged work on HDFS-16601:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 09/Jun/22 14:56
Start Date: 09/Jun/22 14:56
Worklog Time Spent: 10m
Work Description: ZanderXu commented on PR #4369:
URL: https://github.com/apache/hadoop/pull/4369#issuecomment-1151228050
Thanks @Hexiaoqiao for your suggestion. Yeah, your are right, we need more
failed information for client, like transfer source failed or transfer target
failed. If client have more information about failed transfer, It can
accurately and efficiently remove abnormal nodes. But this would be a big
feature.
Fortunately, at present, as long as failed exception throw to client, the
client defaults to thinking that the new dn is abnormal, and will exclude it
and retry transfer. During retrying transfer, Client will chose new source dn
and new target dn. Therefor, the source and target dn in the previous failed
transfer round will be replaced.
If it is target dn caused failed, excluded the target dn will be ok.
If it is source dn caused failed, it will be removed when building the new
pipeline.
So I think simple process is just throw failed exception to client, and
client can find and remove the real abnormal datanode.
Issue Time Tracking
-------------------
Worklog Id: (was: 779997)
Time Spent: 1h 10m (was: 1h)
> Failed to replace a bad datanode on the existing pipeline due to no more good
> datanodes being available to try
> --------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-16601
> URL: https://issues.apache.org/jira/browse/HDFS-16601
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: ZanderXu
> Assignee: ZanderXu
> Priority: Major
> Labels: pull-request-available
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> In our production environment, we found a bug and stack like:
> {code:java}
> java.io.IOException: Failed to replace a bad datanode on the existing
> pipeline due to no more good datanodes being available to try. (Nodes:
> current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK],
>
> DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]],
>
> original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK],
>
> DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]).
> The current failed datanode replacement policy is DEFAULT, and a client may
> configure this via
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its
> configuration.
> at
> org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
> at
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
> at
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
> at
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> And the root cause is that DFSClient cannot perceive the exception of
> TransferBlock during PipelineRecovery. If failed during TransferBlock, the
> DFSClient will retry all datanodes in the cluster and then failed.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]