Yiqun Lin created HDFS-17604:
--------------------------------
Summary: EC block deletion under snapshot makes NameNode crashed
Key: HDFS-17604
URL: https://issues.apache.org/jira/browse/HDFS-17604
Project: Hadoop HDFS
Issue Type: Bug
Components: ec, erasure-coding
Affects Versions: 3.3.3
Environment: We meet a corner case that sometimes EC block deletion
under HDFS snapshot could make NameNode crashed.
The stacktrace error:
{noformat}
2024-07-10 23:17:47,665 ERROR
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
on operation DeleteOp [length=0, path=xxxx, timestamp=1720678635100,
RpcClientId=5161c587-9102-41cf-b823-fe618db9ab4c, RpcCallId=177,
opCode=OP_DELETE, txid=55577688248]
java.lang.IllegalStateException
at
org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:494)
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.collectBlocksBeyondSnapshot(INodeFile.java:1225)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.collectBlocksAndClear(FileWithSnapshotFeature.java:240)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.cleanFile(FileWithSnapshotFeature.java:134)
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.cleanSubtree(INodeFile.java:754)
at
org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:714)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyCreatedList(DirectoryWithSnapshotFeature.java:75)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$800(DirectoryWithSnapshotFeature.java:48)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.destroyDstSubtree(DirectoryWithSnapshotFeature.java:423)
at
org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:720)
at
org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.unprotectedDelete(FSDirDeleteOp.java:258)
at
org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.deleteForEditLog(FSDirDeleteOp.java:143)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:630)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:288)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:183)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:915)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:364)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:505)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:451)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:468)
{noformat}
The reason for this is that we assume that EC block deletion will not hit below
truncate code logic since EC doesn't support truncate method.
*INodeFile#collectBlocksBeyondSnapshot*
{noformat}
/**
* This function is only called when block list is stored in snapshot
* diffs. Note that this can only happen when truncation happens with
* snapshots. Since we do not support truncation with striped blocks,
* we only need to handle contiguous blocks here.
*/
public void collectBlocksBeyondSnapshot(BlockInfo[] snapshotBlocks,
BlocksMapUpdateInfo collectedBlocks) {
Preconditions.checkState(!isStriped()); <=== error throw here
BlockInfo[] oldBlocks = getBlocks();
if(snapshotBlocks == null || oldBlocks == null)
return;
...
}
}
{noformat}
But there is a special case that EC block deletion under snapshot can hit this
case, we can reproduce this issue by following below steps:
1) Created a EC folder and trigger the DistCp job to do the data copy into this
folder. This EC folder is also enabled with HDFS snapshot.
2) During the EC data write, we try to create a new Snapshot.
3) Kill the running DistCp job that submitted in step1.
4) Delete the broken EC file that copied in above step. Standby NN will failed
due to above error.
Reporter: Yiqun Lin
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]