[jira] [Work logged] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

ASF GitHub Bot (Jira) Fri, 10 Jun 2022 03:04:53 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16601?focusedWorklogId=780266&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-780266
 ]


ASF GitHub Bot logged work on HDFS-16601:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/Jun/22 10:01
            Start Date: 10/Jun/22 10:01
    Worklog Time Spent: 10m 
      Work Description: Hexiaoqiao commented on PR #4369:
URL: https://github.com/apache/hadoop/pull/4369#issuecomment-1152193661

   > Fortunately, at present, as long as failed exception throw to client, the 
client defaults to thinking that the new dn is abnormal, and will exclude it 
and retry transfer. During retrying transfer, Client will chose new source dn 
and new target dn. 
   
   Thanks for furthermore comment here. Agree that it will improve 
fault-tolerant for transfer, however, we have to accept the truth that the 
source datanode meets issue and choose the same one when retry, thus we could 
not avoid to fail.  I am not sure if any way to expose exceptions to differ 
source Node or target Node exception?  If it is true, it will be helpful for 
the following fault-tolerant improvement at client side. 




Issue Time Tracking
-------------------

    Worklog Id:     (was: 780266)
    Time Spent: 1h 20m  (was: 1h 10m)

> Failed to replace a bad datanode on the existing pipeline due to no more good 
> datanodes being available to try
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16601
>                 URL: https://issues.apache.org/jira/browse/HDFS-16601
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In our production environment, we found a bug and stack like:
> {code:java}
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]],
>  
> original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>       at 
> org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>       at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> And the root cause is that DFSClient cannot  perceive the exception of 
> TransferBlock during PipelineRecovery. If failed during TransferBlock, the 
> DFSClient will retry all datanodes in the cluster and then failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

Reply via email to