[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

ASF GitHub Bot (Jira) Wed, 09 Aug 2023 20:19:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752600#comment-17752600
 ]


ASF GitHub Bot commented on HDFS-17150:
---------------------------------------

zhangshuyan0 opened a new pull request, #5937:
URL: https://github.com/apache/hadoop/pull/5937

   EC: Fix the bug of failed lease recovery.
   
   If the client crashes without writing the minimum number of internal blocks 
required by the EC policy, the lease recovery process for the corresponding 
unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
timeline is as follows:
   1. The client writes some data to only 5 datanodes;
   2. Client crashes;
   3. NN fails over;
   4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
block report, and there are 5 datanodes reporting internal blocks;
   5. When the lease expires hard limit, NN issues a block recovery command;
   6. The datanode checks the command and finds that the number of internal 
blocks is insufficient, resulting in an exception and recovery failure;
   
https://github.com/apache/hadoop/blob/b6edcb9a84ceac340c79cd692637b3e11c997cc5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java#L534-L540
   7. The lease expires hard limit again, and NN issues a block recovery 
command again, but the recovery fails again......
   
   When the number of internal blocks written by the client is less than 6, the 
block group is actually unrecoverable. We should equate this situation to the 
case where the number of replicas is 0 when processing replica files, i.e., 
directly remove the last block group and close the file.
   
   




> EC: Fix the bug of failed lease recovery.
> -----------------------------------------
>
>                 Key: HDFS-17150
>                 URL: https://issues.apache.org/jira/browse/HDFS-17150
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Shuyan Zhang
>            Priority: Major
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again......
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

Reply via email to