hfutatzhanghb opened a new pull request, #6509:
URL: https://github.com/apache/hadoop/pull/6509
### Description of PR
Refer to HDFS-17358.
Recently, there is a strange case happened on our ec production cluster.
The phenomenon is as below described: NameNode does infinite recovery lease
of some ec files(~80K+) and those files could never be closed.
After digging into logs and releated code, we found the root cause is below
codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`:
```java
// we met info.getNumBytes==0 here!
if (info != null &&
info.getGenerationStamp() >= block.getGenerationStamp() &&
info.getNumBytes() > 0) {
final BlockRecord existing = syncBlocks.get(blockId);
if (existing == null ||
info.getNumBytes() > existing.rInfo.getNumBytes()) {
// if we have >1 replicas for the same internal block, we
// simply choose the one with larger length.
// TODO: better usage of redundant replicas
syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info));
}
}
// throw exception here!
checkLocations(syncBlocks.size());
```
The related logs are as below:
>java.io.IOException:
BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828 has
no enough internal blocks, unable to start recovery. Locations=[...]
>2024-01-23 12:48:16,171 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365,
replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR
getNumBytes() = 0 getBytesOnDisk() = 0 getVisibleLength()= -1 getVolume() =
/data25/hadoop/hdfs/datanode getBlockURI() =
file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-x.x.x.x-1702548776421/current/rbw/blk_-9223372036808032686
recoveryId=27529675 original=ReplicaWaitingToBeRecovered,
blk_-9223372036808032686_2938828, RWR getNumBytes() = 0 getBytesOnDisk() = 0
getVisibleLength()= -1 getVolume() = /data25/hadoop/hdfs/datanode getBlockURI()
=
file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-10.104.10.198-1702548776421/current/rbw/blk_-9223372036808032686
because the length of RWR is zero, the length of the returned object in
below codes is zero. We can't put it into syncBlocks.
So throw exception in checkLocations method.
>ReplicaRecoveryInfo info = callInitReplicaRecovery(proxyDN,new
RecoveringBlock(internalBlk, null, recoveryId));
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]