hfutatzhanghb opened a new pull request, #6509:
URL: https://github.com/apache/hadoop/pull/6509

   ### Description of PR
   Refer to HDFS-17358.
   
   Recently, there is a strange case happened on our ec production cluster.
   
   The phenomenon is as below described: NameNode does infinite recovery lease 
of some ec files(~80K+) and those files could never be closed.
   
   After digging into logs and releated code, we found the root cause is below 
codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`:
   
   ```java
             // we met info.getNumBytes==0 here! 
             if (info != null &&
                 info.getGenerationStamp() >= block.getGenerationStamp() &&
                 info.getNumBytes() > 0) {
               final BlockRecord existing = syncBlocks.get(blockId);
               if (existing == null ||
                   info.getNumBytes() > existing.rInfo.getNumBytes()) {
                 // if we have >1 replicas for the same internal block, we
                 // simply choose the one with larger length.
                 // TODO: better usage of redundant replicas
                 syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info));
               }
             }
   
             // throw exception here!
             checkLocations(syncBlocks.size());
   
   ```
   
   The related logs are as below:
   
   >java.io.IOException: 
BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828 has 
no enough internal blocks, unable to start recovery. Locations=[...] 
   
   >2024-01-23 12:48:16,171 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365, 
replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR 
getNumBytes() = 0 getBytesOnDisk() = 0 getVisibleLength()= -1 getVolume() = 
/data25/hadoop/hdfs/datanode getBlockURI() = 
file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-x.x.x.x-1702548776421/current/rbw/blk_-9223372036808032686
 recoveryId=27529675 original=ReplicaWaitingToBeRecovered, 
blk_-9223372036808032686_2938828, RWR getNumBytes() = 0 getBytesOnDisk() = 0 
getVisibleLength()= -1 getVolume() = /data25/hadoop/hdfs/datanode getBlockURI() 
= 
file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-10.104.10.198-1702548776421/current/rbw/blk_-9223372036808032686
   
   because the length of RWR is zero,  the length of the returned object in 
below codes is zero. We can't put it into syncBlocks.
   So throw exception in checkLocations method.
   
   >ReplicaRecoveryInfo info = callInitReplicaRecovery(proxyDN,new 
RecoveringBlock(internalBlk, null, recoveryId)); 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to