[ 
https://issues.apache.org/jira/browse/HDFS-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Li resolved HDFS-17358.
---------------------------
    Fix Version/s: 3.5.0
       Resolution: Fixed

> EC: infinite lease recovery caused by the length of RWR equals to zero.
> -----------------------------------------------------------------------
>
>                 Key: HDFS-17358
>                 URL: https://issues.apache.org/jira/browse/HDFS-17358
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec
>            Reporter: farmmamba
>            Assignee: farmmamba
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.0
>
>
> Recently, there is a strange case happened on our ec production cluster.
> The phenomenon is as below described: NameNode does infinite recovery lease 
> of some ec files(~80K+) and those files could never be closed.
>  
> After digging into logs and releated code, we found the root cause is below 
> codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`:
> {code:java}
>           // we met info.getNumBytes==0 here! 
>           if (info != null &&
>               info.getGenerationStamp() >= block.getGenerationStamp() &&
>               info.getNumBytes() > 0) {
>             final BlockRecord existing = syncBlocks.get(blockId);
>             if (existing == null ||
>                 info.getNumBytes() > existing.rInfo.getNumBytes()) {
>               // if we have >1 replicas for the same internal block, we
>               // simply choose the one with larger length.
>               // TODO: better usage of redundant replicas
>               syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info));
>             }
>           }
>           // throw exception here!
>           checkLocations(syncBlocks.size());
> {code}
> The related logs are as below:
> {code:java}
> java.io.IOException: 
> BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828 
> has no enough internal blocks, unable to start recovery. Locations=[...] 
> {code}
> {code:java}
> 2024-01-23 12:48:16,171 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365, 
> replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR 
> getNumBytes() = 0 getBytesOnDisk() = 0 getVisibleLength()= -1 getVolume() = 
> /data25/hadoop/hdfs/datanode getBlockURI() = 
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-x.x.x.x-1702548776421/current/rbw/blk_-9223372036808032686
>  recoveryId=27529675 original=ReplicaWaitingToBeRecovered, 
> blk_-9223372036808032686_2938828, RWR getNumBytes() = 0 getBytesOnDisk() = 0 
> getVisibleLength()= -1 getVolume() = /data25/hadoop/hdfs/datanode 
> getBlockURI() = 
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-10.104.10.198-1702548776421/current/rbw/blk_-9223372036808032686
> {code}
> because the length of RWR is zero,  the length of the returned object in 
> below codes is zero. We can't put it into syncBlocks.
> So throw exception in checkLocations method.
> {code:java}
>           ReplicaRecoveryInfo info = callInitReplicaRecovery(proxyDN,
>               new RecoveringBlock(internalBlk, null, recoveryId)); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to