[jira] [Work logged] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

ASF GitHub Bot (Jira) Tue, 26 Jul 2022 21:54:24 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16601?focusedWorklogId=795517&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-795517
 ]


ASF GitHub Bot logged work on HDFS-16601:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Jul/22 04:53
            Start Date: 27/Jul/22 04:53
    Worklog Time Spent: 10m 
      Work Description: ZanderXu commented on PR #4369:
URL: https://github.com/apache/hadoop/pull/4369#issuecomment-1196264884

   @jojochuang Thanks for you review. We encounter this bug in our prod, 
because the block‘s checksum file of the source DN is corrupted. It caused 
transfer failed. And client tried all DNs and failed.
   
   So Client should sense the status of transfer. But it's difficult to differ 
the exception caused by source Node or target Node. Maybe we can first throw 
the failed exception to Client and let Client try to use the next DN as the 
source to transfer block.
   
   cc @Hexiaoqiao 
   
   




Issue Time Tracking
-------------------

    Worklog Id:     (was: 795517)
    Time Spent: 2h 10m  (was: 2h)

> Failed to replace a bad datanode on the existing pipeline due to no more good 
> datanodes being available to try
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16601
>                 URL: https://issues.apache.org/jira/browse/HDFS-16601
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> In our production environment, we found a bug and stack like:
> {code:java}
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]],
>  
> original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>       at 
> org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>       at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> And the root cause is that DFSClient cannot  perceive the exception of 
> TransferBlock during PipelineRecovery. If failed during TransferBlock, the 
> DFSClient will retry all datanodes in the cluster and then failed.
> When client is recovering pipeline, the source dn selected to transfer block 
> to new DN may be abnormal,  it cannot successfully transfer the block to the 
> new node. But the failed exception not returned to the client, Client also 
> thought transfer successfully. But there is not block in the new DN, so 
> Client failed to build the pipeline, and marked the new DN is bad. And then 
> Client will add the new DN into exclude list to get a new DN for the new loop 
> pipeline recovery. The new pipeline recovery will still choose the abnormal 
> dn as the source dn to transfer block, and it will fail again..
> So I think that DN should return the failed exception of transfer to Client, 
> so that Client can choose anther existed dn as the source dn to transfer the 
> block to a new DN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

Reply via email to