[ https://issues.apache.org/jira/browse/HDFS-16601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895083#comment-17895083 ]
ASF GitHub Bot commented on HDFS-16601: --------------------------------------- ferhui commented on PR #4369: URL: https://github.com/apache/hadoop/pull/4369#issuecomment-2453266464 Hi, can merge it since it has been approved for a long time? @ZanderXu @haiyang1987 > DataTransfer should throw IOException to Client > ----------------------------------------------- > > Key: HDFS-16601 > URL: https://issues.apache.org/jira/browse/HDFS-16601 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: ZanderXu > Assignee: ZanderXu > Priority: Major > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > In our production environment, we found a bug and stack like: > {code:java} > java.io.IOException: Failed to replace a bad datanode on the existing > pipeline due to no more good datanodes being available to try. (Nodes: > current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK], > > DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]], > > original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK], > > DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]). > The current failed datanode replacement policy is DEFAULT, and a client may > configure this via > 'dfs.client.block.write.replace-datanode-on-failure.policy' in its > configuration. > at > org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418) > at > org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478) > at > org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674) > {code} > And the root cause is that DFSClient cannot perceive the exception of > TransferBlock during PipelineRecovery. If failed during TransferBlock, the > DFSClient will retry all datanodes in the cluster and then failed. > When client is recovering pipeline, the source dn selected to transfer block > to new DN may be abnormal, it cannot successfully transfer the block to the > new node. But the failed exception not returned to the client, Client also > thought transfer successfully. But there is not block in the new DN, so > Client failed to build the pipeline, and marked the new DN is bad. And then > Client will add the new DN into exclude list to get a new DN for the new loop > pipeline recovery. The new pipeline recovery will still choose the abnormal > dn as the source dn to transfer block, and it will fail again.. > So I think that DN should return the failed exception of transfer to Client, > so that Client can choose anther existed dn as the source dn to transfer the > block to a new DN. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org