[ 
https://issues.apache.org/jira/browse/HDFS-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-17604:
-----------------------------
    Description: 
We meet a corner case that sometimes EC block deletion under HDFS snapshot 
could make NameNode crashed.

The stacktrace error:
{noformat}
2024-07-10 23:17:47,665 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation DeleteOp [length=0, path=xxxx, timestamp=1720678635100, 
RpcClientId=5161c587-9102-41cf-b823-fe618db9ab4c, RpcCallId=177, 
opCode=OP_DELETE, txid=55577688248]
java.lang.IllegalStateException
        at 
org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:494)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.collectBlocksBeyondSnapshot(INodeFile.java:1225)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.collectBlocksAndClear(FileWithSnapshotFeature.java:240)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.cleanFile(FileWithSnapshotFeature.java:134)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.cleanSubtree(INodeFile.java:754)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:714)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyCreatedList(DirectoryWithSnapshotFeature.java:75)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$800(DirectoryWithSnapshotFeature.java:48)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.destroyDstSubtree(DirectoryWithSnapshotFeature.java:423)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:720)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.unprotectedDelete(FSDirDeleteOp.java:258)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.deleteForEditLog(FSDirDeleteOp.java:143)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:630)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:288)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:183)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:915)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:364)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:505)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:451)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:468)
{noformat}
The reason for this is that we assume that EC block deletion will not hit below 
truncate code logic since EC doesn't support truncate method.

*INodeFile#collectBlocksBeyondSnapshot*
{noformat}
/**
 * This function is only called when block list is stored in snapshot
 * diffs. Note that this can only happen when truncation happens with
 * snapshots. Since we do not support truncation with striped blocks,
 * we only need to handle contiguous blocks here.
 */
public void collectBlocksBeyondSnapshot(BlockInfo[] snapshotBlocks,
                                        BlocksMapUpdateInfo collectedBlocks) {
  Preconditions.checkState(!isStriped());   <=== error throw here
  BlockInfo[] oldBlocks = getBlocks();
  if(snapshotBlocks == null || oldBlocks == null)
    return;
  ...
  }
}
{noformat}
But there is a special case that EC block deletion under snapshot can hit this 
case, we can reproduce this issue by following below steps:

1) Created a EC folder and execute the Cp command to copy one large file into 
this folder. This EC folder is also enabled with HDFS snapshot.
2) During the EC data write, we try to create a new Snapshot on this folder
3) Kill the running Cp process that executed in step1.
4) Delete the broken EC file that copied in above step. Standby NN will failed 
due to above error.

  was:
We meet a corner case that sometimes EC block deletion under HDFS snapshot 
could make NameNode crashed.

The stacktrace error:
{noformat}
2024-07-10 23:17:47,665 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation DeleteOp [length=0, path=xxxx, timestamp=1720678635100, 
RpcClientId=5161c587-9102-41cf-b823-fe618db9ab4c, RpcCallId=177, 
opCode=OP_DELETE, txid=55577688248]
java.lang.IllegalStateException
        at 
org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:494)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.collectBlocksBeyondSnapshot(INodeFile.java:1225)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.collectBlocksAndClear(FileWithSnapshotFeature.java:240)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.cleanFile(FileWithSnapshotFeature.java:134)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.cleanSubtree(INodeFile.java:754)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:714)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyCreatedList(DirectoryWithSnapshotFeature.java:75)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$800(DirectoryWithSnapshotFeature.java:48)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.destroyDstSubtree(DirectoryWithSnapshotFeature.java:423)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:720)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.unprotectedDelete(FSDirDeleteOp.java:258)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.deleteForEditLog(FSDirDeleteOp.java:143)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:630)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:288)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:183)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:915)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:364)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:505)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:451)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:468)
{noformat}
The reason for this is that we assume that EC block deletion will not hit below 
truncate code logic since EC doesn't support truncate method.

*INodeFile#collectBlocksBeyondSnapshot*
{noformat}
/**
 * This function is only called when block list is stored in snapshot
 * diffs. Note that this can only happen when truncation happens with
 * snapshots. Since we do not support truncation with striped blocks,
 * we only need to handle contiguous blocks here.
 */
public void collectBlocksBeyondSnapshot(BlockInfo[] snapshotBlocks,
                                        BlocksMapUpdateInfo collectedBlocks) {
  Preconditions.checkState(!isStriped());   <=== error throw here
  BlockInfo[] oldBlocks = getBlocks();
  if(snapshotBlocks == null || oldBlocks == null)
    return;
  ...
  }
}
{noformat}
But there is a special case that EC block deletion under snapshot can hit this 
case, we can reproduce this issue by following below steps:

1) Created a EC folder and execute the Cp command to copy one large file into 
this folder. This EC folder is also enabled with HDFS snapshot.
2) During the EC data write, we try to  create a new Snapshot.
3) Kill the running Cp process that executed in step1.
4) Delete the broken EC file that copied in above step. Standby NN will failed 
due to above error.


> EC block deletion under snapshot makes NameNode crashed
> -------------------------------------------------------
>
>                 Key: HDFS-17604
>                 URL: https://issues.apache.org/jira/browse/HDFS-17604
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec, erasure-coding
>    Affects Versions: 3.3.3
>            Reporter: Yiqun Lin
>            Priority: Major
>
> We meet a corner case that sometimes EC block deletion under HDFS snapshot 
> could make NameNode crashed.
> The stacktrace error:
> {noformat}
> 2024-07-10 23:17:47,665 ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
> on operation DeleteOp [length=0, path=xxxx, timestamp=1720678635100, 
> RpcClientId=5161c587-9102-41cf-b823-fe618db9ab4c, RpcCallId=177, 
> opCode=OP_DELETE, txid=55577688248]
> java.lang.IllegalStateException
>         at 
> org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:494)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.collectBlocksBeyondSnapshot(INodeFile.java:1225)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.collectBlocksAndClear(FileWithSnapshotFeature.java:240)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.cleanFile(FileWithSnapshotFeature.java:134)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.cleanSubtree(INodeFile.java:754)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:714)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyCreatedList(DirectoryWithSnapshotFeature.java:75)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$800(DirectoryWithSnapshotFeature.java:48)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.destroyDstSubtree(DirectoryWithSnapshotFeature.java:423)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeReference$DstReference.destroyAndCollectBlocks(INodeReference.java:720)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.unprotectedDelete(FSDirDeleteOp.java:258)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.deleteForEditLog(FSDirDeleteOp.java:143)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:630)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:288)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:183)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:915)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:364)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:505)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:451)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:468)
> {noformat}
> The reason for this is that we assume that EC block deletion will not hit 
> below truncate code logic since EC doesn't support truncate method.
> *INodeFile#collectBlocksBeyondSnapshot*
> {noformat}
> /**
>  * This function is only called when block list is stored in snapshot
>  * diffs. Note that this can only happen when truncation happens with
>  * snapshots. Since we do not support truncation with striped blocks,
>  * we only need to handle contiguous blocks here.
>  */
> public void collectBlocksBeyondSnapshot(BlockInfo[] snapshotBlocks,
>                                         BlocksMapUpdateInfo collectedBlocks) {
>   Preconditions.checkState(!isStriped());   <=== error throw here
>   BlockInfo[] oldBlocks = getBlocks();
>   if(snapshotBlocks == null || oldBlocks == null)
>     return;
>   ...
>   }
> }
> {noformat}
> But there is a special case that EC block deletion under snapshot can hit 
> this case, we can reproduce this issue by following below steps:
> 1) Created a EC folder and execute the Cp command to copy one large file into 
> this folder. This EC folder is also enabled with HDFS snapshot.
> 2) During the EC data write, we try to create a new Snapshot on this folder
> 3) Kill the running Cp process that executed in step1.
> 4) Delete the broken EC file that copied in above step. Standby NN will 
> failed due to above error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to