[jira] [Work logged] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

ASF GitHub Bot (Jira) Thu, 09 Jun 2022 07:04:04 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16601?focusedWorklogId=779967&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779967
 ]


ASF GitHub Bot logged work on HDFS-16601:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Jun/22 14:01
            Start Date: 09/Jun/22 14:01
    Worklog Time Spent: 10m 
      Work Description: Hexiaoqiao commented on PR #4369:
URL: https://github.com/apache/hadoop/pull/4369#issuecomment-1151160510

   Thanks for starting this proposal. I think there are still many issues for 
data transfer for pipeline recovery from my practice, which includes both basic 
function and performance. IIRC, there are only timeout exception and one no 
explicit meaning exception, thus client has no helpful information (such src 
node or target node meet issue, or other exceptions) to make decision. 
   Back to this PR, I totally agree to throw exception from datanode to client 
first(but I am not sure if it is enough at this PR, maybe we need more 
information) then add more fault-tolerant logic at client side.
   IMO, we should file one new JIRA to design/refactor fault-tolerant for data 
transfer of pipeline recovery. Just my own suggestion, not blocker.




Issue Time Tracking
-------------------

    Worklog Id:     (was: 779967)
    Time Spent: 1h  (was: 50m)

> Failed to replace a bad datanode on the existing pipeline due to no more good 
> datanodes being available to try
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16601
>                 URL: https://issues.apache.org/jira/browse/HDFS-16601
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> In our production environment, we found a bug and stack like:
> {code:java}
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]],
>  
> original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>       at 
> org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>       at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> And the root cause is that DFSClient cannot  perceive the exception of 
> TransferBlock during PipelineRecovery. If failed during TransferBlock, the 
> DFSClient will retry all datanodes in the cluster and then failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

Reply via email to