[ 
https://issues.apache.org/jira/browse/HDFS-14800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919344#comment-16919344
 ] 

Jinglun edited comment on HDFS-14800 at 8/30/19 9:06 AM:
---------------------------------------------------------

Hi [~jojochuang], I think the problem is caused by adding the recovery attemp 
before initializing the block recovery.
 When the first recoverLease call arrives it will be added to 
pendingRecoveryBlocks by addBlockRecoveryAttempt().  If the NameNode has not 
got any block report for the last block, the initializeBLockRecovery() will 
fail.  So the block recovery doesn't start.
 Then the later recoverLease calls arrive and are all rejected by 
addBlockRecoveryAttempt() until the first is timeout.
 The failing unit test only retry 10 times so it failed.
{code:java}
if (blockManager.addBlockRecoveryAttempt(lastBlock)) {
  long blockRecoveryId = nextGenerationStamp(
  blockManager.isLegacyBlock(lastBlock));
  if(copyOnTruncate) {
    lastBlock.setGenerationStamp(blockRecoveryId);
  } else if(truncateRecovery) {
    recoveryBlock.setGenerationStamp(blockRecoveryId);
  }
  uc.initializeBlockRecovery(lastBlock, blockRecoveryId, true);
  // Cannot close file right now, since the last block requires recovery.
  // This may potentially cause infinite loop in lease recovery
  // if there are no valid replicas on data-nodes.
  NameNode.stateChangeLog.warn(
    "DIR* NameSystem.internalReleaseLease: " +
    "File " + src + " has not been closed." +
    " Lease recovery is in progress. " +
    "RecoveryId = " + blockRecoveryId + " for block " + lastBlock);
 }{code}
 

Here is the corresponding log.

//*The 1st recoverLease call. Failed because no blocks found.*

2019-08-29 13:34:31,788 [IPC Server handler 9 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2682)) - 
recoverLease: [Lease. Holder: DFSClient_8256078, pending creates: 13], 
src=/1kb-multiple-checksum-blocks-64-16 from client DFSClient_8256078
 2019-08-29 13:34:31,788 [IPC Server handler 9 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(3424)) - 
Recovering [Lease. Holder: DFSClient_8256078, pending creates: 13], 
src=/1kb-multiple-checksum-blocks-64-16
 2019-08-29 13:34:31,789 [IPC Server handler 9 on default port 39637] WARN 
BlockStateChange 
(BlockUnderConstructionFeature.java:initializeBlockRecovery(238)) - BLOCK* 
BlockUnderConstructionFeature.initializeBlockRecovery: No blocks found, lease 
removed.
 2019-08-29 13:34:31,790 [IPC Server handler 9 on default port 39637] WARN 
hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3550)) - DIR* 
NameSystem.internalReleaseLease: File /1kb-multiple-checksum-blocks-64-16 has 
not been closed. Lease recovery is in progress. RecoveryId = 1031 for block 
blk_7162739548153522810_1020

 

//*The 2st recoverLease call. Failed because the previous attempt doesn't 
timeout.*
 2019-08-29 13:34:32,795 [IPC Server handler 8 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2682)) - 
recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
creates: 1], src=/1kb-multiple-checksum-blocks-64-16 from client 
DFSClient_NONMAPREDUCE_1352689944_1
 2019-08-29 13:34:32,795 [IPC Server handler 8 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(3424)) - 
Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
creates: 1], src=/1kb-multiple-checksum-blocks-64-16
 2019-08-29 13:34:32,796 [IPC Server handler 8 on default port 39637] INFO 
blockmanagement.BlockManager (PendingRecoveryBlocks.java:add(80)) - Block 
recovery attempt for blk_7162739548153522810_1020 rejected, as the previous 
attempt times out in 88 seconds.

 

//*The 3st recoverLease call.*
 2019-08-29 13:34:33,798 [IPC Server handler 0 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2682)) - 
recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
creates: 1], src=/1kb-multiple-checksum-blocks-64-16 from client 
DFSClient_NONMAPREDUCE_1352689944_1
 2019-08-29 13:34:33,799 [IPC Server handler 0 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(3424)) - 
Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
creates: 1], src=/1kb-multiple-checksum-blocks-64-16
 2019-08-29 13:34:33,799 [IPC Server handler 0 on default port 39637] INFO 
blockmanagement.BlockManager (PendingRecoveryBlocks.java:add(80)) - Block 
recovery attempt for blk_7162739548153522810_1020 rejected, as the previous 
attempt times out in 87 seconds.

 

//*There are another 4st~10st recoverLease logs and are the same with the 2st 
and 3st. Ignore them.*


was (Author: lijinglun):
Hi [~jojochuang], I think the problem is caused by adding the recovery attemp 
before initializing the block recovery.
When the first recoverLease call arrives it will be added to 
pendingRecoveryBlocks by addBlockRecoveryAttempt().  If the NameNode has not 
got any block report for the last block, the initializeBLockRecovery() will 
fail.  So the block recovery doesn't start.
Then the later recoverLease calls arrive and are all rejected by 
addBlockRecoveryAttempt() until the first is timeout.
The failing unit test only retry 10 times so it failed.
{code:java}
if (blockManager.addBlockRecoveryAttempt(lastBlock)) {
  long blockRecoveryId = nextGenerationStamp(
  blockManager.isLegacyBlock(lastBlock));
  if(copyOnTruncate) {
    lastBlock.setGenerationStamp(blockRecoveryId);
  } else if(truncateRecovery) {
    recoveryBlock.setGenerationStamp(blockRecoveryId);
  }
  uc.initializeBlockRecovery(lastBlock, blockRecoveryId, true);
  // Cannot close file right now, since the last block requires recovery.
  // This may potentially cause infinite loop in lease recovery
  // if there are no valid replicas on data-nodes.
  NameNode.stateChangeLog.warn(
    "DIR* NameSystem.internalReleaseLease: " +
    "File " + src + " has not been closed." +
    " Lease recovery is in progress. " +
    "RecoveryId = " + blockRecoveryId + " for block " + lastBlock);
 }{code}
 

Here is the corresponding log.

## The 1st recoverLease call. Failed because no blocks found.
2019-08-29 13:34:31,788 [IPC Server handler 9 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2682)) - 
recoverLease: [Lease. Holder: DFSClient_8256078, pending creates: 13], 
src=/1kb-multiple-checksum-blocks-64-16 from client DFSClient_8256078
2019-08-29 13:34:31,788 [IPC Server handler 9 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(3424)) - 
Recovering [Lease. Holder: DFSClient_8256078, pending creates: 13], 
src=/1kb-multiple-checksum-blocks-64-16
2019-08-29 13:34:31,789 [IPC Server handler 9 on default port 39637] WARN 
BlockStateChange 
(BlockUnderConstructionFeature.java:initializeBlockRecovery(238)) - BLOCK* 
BlockUnderConstructionFeature.initializeBlockRecovery: No blocks found, lease 
removed.
2019-08-29 13:34:31,790 [IPC Server handler 9 on default port 39637] WARN 
hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3550)) - DIR* 
NameSystem.internalReleaseLease: File /1kb-multiple-checksum-blocks-64-16 has 
not been closed. Lease recovery is in progress. RecoveryId = 1031 for block 
blk_7162739548153522810_1020

## The 2st recoverLease call. Failed because the previous attempt doesn't 
timeout.
2019-08-29 13:34:32,795 [IPC Server handler 8 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2682)) - 
recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
creates: 1], src=/1kb-multiple-checksum-blocks-64-16 from client 
DFSClient_NONMAPREDUCE_1352689944_1
2019-08-29 13:34:32,795 [IPC Server handler 8 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(3424)) - 
Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
creates: 1], src=/1kb-multiple-checksum-blocks-64-16
2019-08-29 13:34:32,796 [IPC Server handler 8 on default port 39637] INFO 
blockmanagement.BlockManager (PendingRecoveryBlocks.java:add(80)) - Block 
recovery attempt for blk_7162739548153522810_1020 rejected, as the previous 
attempt times out in 88 seconds.

## The 3st recoverLease call.
2019-08-29 13:34:33,798 [IPC Server handler 0 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2682)) - 
recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
creates: 1], src=/1kb-multiple-checksum-blocks-64-16 from client 
DFSClient_NONMAPREDUCE_1352689944_1
2019-08-29 13:34:33,799 [IPC Server handler 0 on default port 39637] INFO 
namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(3424)) - 
Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
creates: 1], src=/1kb-multiple-checksum-blocks-64-16
2019-08-29 13:34:33,799 [IPC Server handler 0 on default port 39637] INFO 
blockmanagement.BlockManager (PendingRecoveryBlocks.java:add(80)) - Block 
recovery attempt for blk_7162739548153522810_1020 rejected, as the previous 
attempt times out in 87 seconds.

## There are another 4st~10st recoverLease logs and are the same with the 2st 
and 3st. Ignore them.

> Data race between block report and recoverLease()
> -------------------------------------------------
>
>                 Key: HDFS-14800
>                 URL: https://issues.apache.org/jira/browse/HDFS-14800
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>         Attachments: PreCommit-HDFS-Build #27717 test - 
> testUpgradeFromRel1BBWImage [Jenkins].pdf
>
>
> I thought I fixed it in HDFS-10240, but I am seeing a similar race condition 
> in a precommit test again.
> https://builds.apache.org/job/PreCommit-HDFS-Build/27717/testReport/org.apache.hadoop.hdfs/TestDFSUpgradeFromImage/testUpgradeFromRel1BBWImage/
> {noformat}
> 2019-08-29 13:34:31,788 [IPC Server handler 9 on default port 39637] INFO  
> namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2682)) - 
> recoverLease: [Lease.  Holder: DFSClient_8256078, pending creates: 13], 
> src=/1kb-multiple-checksum-blocks-64-16 from client DFSClient_8256078
> 2019-08-29 13:34:31,788 [IPC Server handler 9 on default port 39637] INFO  
> namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(3424)) - 
> Recovering [Lease.  Holder: DFSClient_8256078, pending creates: 13], 
> src=/1kb-multiple-checksum-blocks-64-16
> 2019-08-29 13:34:31,789 [IPC Server handler 9 on default port 39637] WARN  
> BlockStateChange 
> (BlockUnderConstructionFeature.java:initializeBlockRecovery(238)) - BLOCK* 
> BlockUnderConstructionFeature.initializeBlockRecovery: No blocks found, lease 
> removed.
> 2019-08-29 13:34:31,790 [IPC Server handler 9 on default port 39637] WARN  
> hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3550)) - DIR* 
> NameSystem.internalReleaseLease: File /1kb-multiple-checksum-blocks-64-16 has 
> not been closed. Lease recovery is in progress. RecoveryId = 1031 for block 
> blk_7162739548153522810_1020
> 2019-08-29 13:34:31,791 [Block report processor] INFO  BlockStateChange 
> (BlockManager.java:processReport(2645)) - BLOCK* processReport 
> 0x8f0bcadff51597e8: Processing first storage report for 
> DS-dd914776-8c9f-4f0d-9bd3-91a552bdc351 from datanode 
> 187b2e09-75e8-4bd0-ab46-90e32839a21d
> 2019-08-29 13:34:31,793 [Block report processor] INFO  BlockStateChange 
> (BlockManager.java:processReport(2674)) - BLOCK* processReport 
> 0x8f0bcadff51597e8: from storage DS-dd914776-8c9f-4f0d-9bd3-91a552bdc351 node 
> DatanodeRegistration(127.0.0.1:32987, 
> datanodeUuid=187b2e09-75e8-4bd0-ab46-90e32839a21d, infoPort=42569, 
> infoSecurePort=0, ipcPort=39147, 
> storageInfo=lv=-57;cid=testClusterId;nsid=889473900;c=1567085670457), blocks: 
> 13, hasStaleStorage: false, processing time: 2 msecs, invalidatedBlocks: 0
> 2019-08-29 13:34:31,808 [BP-268013932-172.17.0.2-1567085670423 heartbeating 
> to localhost/127.0.0.1:39637] INFO  datanode.DataNode 
> (BPServiceActor.java:blockReport(432)) - Successfully sent block report 
> 0x8f0bcadff51597e8,  containing 1 storage report(s), of which we sent 1. The 
> reports had 13 total blocks and used 1 RPC(s). This took 4 msecs to generate 
> and 33 msecs for RPC and NN processing. Got back no commands.
> 2019-08-29 13:34:32,795 [IPC Server handler 8 on default port 39637] INFO  
> namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2682)) - 
> recoverLease: [Lease.  Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
> creates: 1], src=/1kb-multiple-checksum-blocks-64-16 from client 
> DFSClient_NONMAPREDUCE_1352689944_1
> 2019-08-29 13:34:32,795 [IPC Server handler 8 on default port 39637] INFO  
> namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(3424)) - 
> Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_1352689944_1, pending 
> creates: 1], src=/1kb-multiple-checksum-blocks-64-16
> 2019-08-29 13:34:32,796 [IPC Server handler 8 on default port 39637] INFO  
> blockmanagement.BlockManager (PendingRecoveryBlocks.java:add(80)) - Block 
> recovery attempt for blk_7162739548153522810_1020 rejected, as the previous 
> attempt times out in 88 seconds.
> {noformat}
> Looks like if client calls recoverLease on a file before NameNode receives a 
> block report from DataNodes, it will fail to recover the lease, and the file 
> remains unclosed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to