[ 
https://issues.apache.org/jira/browse/HDFS-16601?focusedWorklogId=780272&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-780272
 ]

ASF GitHub Bot logged work on HDFS-16601:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/Jun/22 10:09
            Start Date: 10/Jun/22 10:09
    Worklog Time Spent: 10m 
      Work Description: ZanderXu commented on PR #4369:
URL: https://github.com/apache/hadoop/pull/4369#issuecomment-1152200912

   > the source datanode meets issue and choose the same one when retry
   
   It will chose the next datanode as source datanode when retry.
   
   Code like blew, and tried will +1 when retry.
   ```
         final DatanodeInfo src = original[tried % original.length];
         final DatanodeInfo[] targets = {nodes[d]};
         final StorageType[] targetStorageTypes = {storageTypes[d]};
   
         try {
           transfer(src, targets, targetStorageTypes, lb.getBlockToken());
         } catch (IOException ioe) {
           DFSClient.LOG.warn("Error transferring data from " + src + " to " +
               nodes[d] + ": " + ioe.getMessage());
           caughtException = ioe;
           // add the allocated node to the exclude list.
           exclude.add(nodes[d]);
           setPipeline(original, originalTypes, originalIDs);
           tried++;
           continue;
         }
   ```




Issue Time Tracking
-------------------

    Worklog Id:     (was: 780272)
    Time Spent: 1.5h  (was: 1h 20m)

> Failed to replace a bad datanode on the existing pipeline due to no more good 
> datanodes being available to try
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16601
>                 URL: https://issues.apache.org/jira/browse/HDFS-16601
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In our production environment, we found a bug and stack like:
> {code:java}
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]],
>  
> original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]).
>  The current failed datanode replacement policy is DEFAULT, and a client may 
> configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its 
> configuration.
>       at 
> org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>       at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> And the root cause is that DFSClient cannot  perceive the exception of 
> TransferBlock during PipelineRecovery. If failed during TransferBlock, the 
> DFSClient will retry all datanodes in the cluster and then failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to