[jira] [Work logged] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

ASF GitHub Bot (Jira) Thu, 09 Jun 2022 07:57:10 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16601?focusedWorklogId=779997&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779997
 ]


ASF GitHub Bot logged work on HDFS-16601:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Jun/22 14:56
            Start Date: 09/Jun/22 14:56
    Worklog Time Spent: 10m 
      Work Description: ZanderXu commented on PR #4369:
URL: https://github.com/apache/hadoop/pull/4369#issuecomment-1151228050

   Thanks @Hexiaoqiao for your suggestion. Yeah, your are right, we need more 
failed information for client, like transfer source failed or transfer target 
failed.  If client have more information about failed transfer, It can 
accurately and efficiently remove abnormal nodes. But this would be a big 
feature.
   
   Fortunately, at present, as long as failed exception throw to client, the 
client defaults to thinking that the new dn is abnormal, and will exclude it 
and retry transfer. During retrying transfer, Client will chose new source dn 
and new target dn. Therefor, the source and target dn in the previous failed 
transfer round will be replaced. 
   If it is target dn caused failed, excluded the target dn will be ok.
   If it is source dn caused failed,  it will be removed when building the new 
pipeline.
   
   So I think simple process is just throw failed exception to client, and 
client can find and remove the real abnormal datanode. 




Issue Time Tracking
-------------------

    Worklog Id:     (was: 779997)
    Time Spent: 1h 10m  (was: 1h)

> Failed to replace a bad datanode on the existing pipeline due to no more good 
> datanodes being available to try
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16601
>                 URL: https://issues.apache.org/jira/browse/HDFS-16601
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In our production environment, we found a bug and stack like:
> {code:java}
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]],
>  
> original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>       at 
> org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>       at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> And the root cause is that DFSClient cannot  perceive the exception of 
> TransferBlock during PipelineRecovery. If failed during TransferBlock, the 
> DFSClient will retry all datanodes in the cluster and then failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

Reply via email to