Srinivasu Majeti created HDFS-17399:
---------------------------------------

             Summary: Ensure atomic transactions when snapshot manager is 
facing OS resource limit issues
                 Key: HDFS-17399
                 URL: https://issues.apache.org/jira/browse/HDFS-17399
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: snapshots
    Affects Versions: 3.1.1
            Reporter: Srinivasu Majeti


One of the customers is facing 'resource' issues ( max number of processes ) at 
least on one of the Namenodes.

{code:java} 
host02: > As a result, Snapshot creation failed on 14th: 2023-05-14 
10:41:28,233 WARN org.apache.hadoop.ipc.Server: IPC Server handler 22 on 8020, 
call Call#11 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.createSnapshot from 
xx.xxx.xx.xxx:59442 java.lang.OutOfMemoryError: unable to create native thread: 
possibly out of memory or process/resource limits reached at 
java.base/java.lang.Thread.start0(Native Method) at 
java.base/java.lang.Thread.start(Thread.java:803) at 
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
 at 
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
 at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodeWithLeases(LeaseManager.java:246)
 at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.addSnapshot(DirectorySnapshottableFeature.java:211)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSnapshot(INodeDirectory.java:288)
 at 
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:463)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.createSnapshot(FSDirSnapshotOp.java:110)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSnapshot(FSNamesystem.java:6767)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSnapshot(NameNodeRpcServer.java:1871)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1273)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno
 \{code} \{code:java} host02 log (NN log) 2023-05-14 10:42:49,983 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 
'http://host03.amd.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true,
 
http://host02.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
 to transaction ID 1623400203 2023-05-14 10:42:49,983 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 
'http://host01.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
 to transaction ID 1623400203 2023-05-14 10:42:50,011 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation DeleteSnapshotOp [snapshotRoot=/user/user1, 
snapshotName=distcp-1546382661--205240459-new, 
RpcClientId=31353569-0e2e-4272-9acf-a6b71f51242c, RpcCallId=18] 
org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot delete snapshot 
distcp-1546382661--205240459-new from path /user/user1: the snapshot does not 
exist. at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:260)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:296)
 
{code} 
Then we identified the wrong records in the edit log and fixed them manually 
{code:java} 
The edit causing the problem is "edits_0000000001623400203-0000000001623402627" 
and contains 38626 lines when converted to XML format. Further investigation, 
we discovered that there are 602 transactions attempting to delete a snapshot 
"distcp-1546382661--205240459-new" which does not exist. OP_DELETE_SNAPSHOT 
1623401061 /user/user1 distcp-1546382661--205240459-new 
31353569-0e2e-4272-9acf-a6b71f51242c 1864 Each transaction consists of above 10 
lines, a total of 6020 lines that need to be removed from the original 38626 
lines. The no of lines after correction is 38626-6020=32606 . 
{code} 
Raising the ticket to discuss how to address this corner issue instead of 
manually correcting edit logs, for example, there should be a defensive 
mechanism in Hadoop but missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to