Yiqun Lin created HDFS-17604: -------------------------------- Summary: EC block deletion under snapshot makes NameNode crashed Key: HDFS-17604 URL: https://issues.apache.org/jira/browse/HDFS-17604 Project: Hadoop HDFS Issue Type: Bug Components: ec, erasure-coding Affects Versions: 3.3.3 Environment: We meet a corner case that sometimes EC block deletion under HDFS snapshot could make NameNode crashed.
The stacktrace error: {noformat} 2024-07-10 23:17:47,665 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation DeleteOp [length=0, path=xxxx, timestamp=1720678635100, RpcClientId=5161c587-9102-41cf-b823-fe618db9ab4c, RpcCallId=177, opCode=OP_DELETE, txid=55577688248] java.lang.IllegalStateException at org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:494) at org.apache.hadoop.hdfs.server.namenode.INodeFile.collectBlocksBeyondSnapshot(INodeFile.java:1225) at org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.collectBlocksAndClear(FileWithSnapshotFeature.java:240) at org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.cleanFile(FileWithSnapshotFeature.java:134) at org.apache.hadoop.hdfs.server.namenode.INodeFile.cleanSubtree(INodeFile.java:754) at org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:714) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyCreatedList(DirectoryWithSnapshotFeature.java:75) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$800(DirectoryWithSnapshotFeature.java:48) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.destroyDstSubtree(DirectoryWithSnapshotFeature.java:423) at org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:720) at org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.unprotectedDelete(FSDirDeleteOp.java:258) at org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.deleteForEditLog(FSDirDeleteOp.java:143) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:630) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:288) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:183) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:915) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:364) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:505) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:451) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:468) {noformat} The reason for this is that we assume that EC block deletion will not hit below truncate code logic since EC doesn't support truncate method. *INodeFile#collectBlocksBeyondSnapshot* {noformat} /** * This function is only called when block list is stored in snapshot * diffs. Note that this can only happen when truncation happens with * snapshots. Since we do not support truncation with striped blocks, * we only need to handle contiguous blocks here. */ public void collectBlocksBeyondSnapshot(BlockInfo[] snapshotBlocks, BlocksMapUpdateInfo collectedBlocks) { Preconditions.checkState(!isStriped()); <=== error throw here BlockInfo[] oldBlocks = getBlocks(); if(snapshotBlocks == null || oldBlocks == null) return; ... } } {noformat} But there is a special case that EC block deletion under snapshot can hit this case, we can reproduce this issue by following below steps: 1) Created a EC folder and trigger the DistCp job to do the data copy into this folder. This EC folder is also enabled with HDFS snapshot. 2) During the EC data write, we try to create a new Snapshot. 3) Kill the running DistCp job that submitted in step1. 4) Delete the broken EC file that copied in above step. Standby NN will failed due to above error. Reporter: Yiqun Lin -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org