Srinivasu Majeti created HDFS-17399: ---------------------------------------
Summary: Ensure atomic transactions when snapshot manager is facing OS resource limit issues Key: HDFS-17399 URL: https://issues.apache.org/jira/browse/HDFS-17399 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Affects Versions: 3.1.1 Reporter: Srinivasu Majeti One of the customers is facing 'resource' issues ( max number of processes ) at least on one of the Namenodes. {code:java} host02: > As a result, Snapshot creation failed on 14th: 2023-05-14 10:41:28,233 WARN org.apache.hadoop.ipc.Server: IPC Server handler 22 on 8020, call Call#11 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.createSnapshot from xx.xxx.xx.xxx:59442 java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached at java.base/java.lang.Thread.start0(Native Method) at java.base/java.lang.Thread.start(Thread.java:803) at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140) at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodeWithLeases(LeaseManager.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.addSnapshot(DirectorySnapshottableFeature.java:211) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSnapshot(INodeDirectory.java:288) at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:463) at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.createSnapshot(FSDirSnapshotOp.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSnapshot(FSNamesystem.java:6767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSnapshot(NameNodeRpcServer.java:1871) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1273) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno \{code} \{code:java} host02 log (NN log) 2023-05-14 10:42:49,983 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'http://host03.amd.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true, http://host02.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true' to transaction ID 1623400203 2023-05-14 10:42:49,983 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'http://host01.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true' to transaction ID 1623400203 2023-05-14 10:42:50,011 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation DeleteSnapshotOp [snapshotRoot=/user/user1, snapshotName=distcp-1546382661--205240459-new, RpcClientId=31353569-0e2e-4272-9acf-a6b71f51242c, RpcCallId=18] org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot delete snapshot distcp-1546382661--205240459-new from path /user/user1: the snapshot does not exist. at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:260) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:296) {code} Then we identified the wrong records in the edit log and fixed them manually {code:java} The edit causing the problem is "edits_0000000001623400203-0000000001623402627" and contains 38626 lines when converted to XML format. Further investigation, we discovered that there are 602 transactions attempting to delete a snapshot "distcp-1546382661--205240459-new" which does not exist. OP_DELETE_SNAPSHOT 1623401061 /user/user1 distcp-1546382661--205240459-new 31353569-0e2e-4272-9acf-a6b71f51242c 1864 Each transaction consists of above 10 lines, a total of 6020 lines that need to be removed from the original 38626 lines. The no of lines after correction is 38626-6020=32606 . {code} Raising the ticket to discuss how to address this corner issue instead of manually correcting edit logs, for example, there should be a defensive mechanism in Hadoop but missing. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org