Srinivasu Majeti created HDFS-17399:
---------------------------------------
Summary: Ensure atomic transactions when snapshot manager is
facing OS resource limit issues
Key: HDFS-17399
URL: https://issues.apache.org/jira/browse/HDFS-17399
Project: Hadoop HDFS
Issue Type: Bug
Components: snapshots
Affects Versions: 3.1.1
Reporter: Srinivasu Majeti
One of the customers is facing 'resource' issues ( max number of processes ) at
least on one of the Namenodes.
{code:java}
host02: > As a result, Snapshot creation failed on 14th: 2023-05-14
10:41:28,233 WARN org.apache.hadoop.ipc.Server: IPC Server handler 22 on 8020,
call Call#11 Retry#0
org.apache.hadoop.hdfs.protocol.ClientProtocol.createSnapshot from
xx.xxx.xx.xxx:59442 java.lang.OutOfMemoryError: unable to create native thread:
possibly out of memory or process/resource limits reached at
java.base/java.lang.Thread.start0(Native Method) at
java.base/java.lang.Thread.start(Thread.java:803) at
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
at
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
at
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
at
org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodeWithLeases(LeaseManager.java:246)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.addSnapshot(DirectorySnapshottableFeature.java:211)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSnapshot(INodeDirectory.java:288)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:463)
at
org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.createSnapshot(FSDirSnapshotOp.java:110)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSnapshot(FSNamesystem.java:6767)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSnapshot(NameNodeRpcServer.java:1871)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1273)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno
\{code} \{code:java} host02 log (NN log) 2023-05-14 10:42:49,983 INFO
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream:
Fast-forwarding stream
'http://host03.amd.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true,
http://host02.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
to transaction ID 1623400203 2023-05-14 10:42:49,983 INFO
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream:
Fast-forwarding stream
'http://host01.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
to transaction ID 1623400203 2023-05-14 10:42:50,011 ERROR
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
on operation DeleteSnapshotOp [snapshotRoot=/user/user1,
snapshotName=distcp-1546382661--205240459-new,
RpcClientId=31353569-0e2e-4272-9acf-a6b71f51242c, RpcCallId=18]
org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot delete snapshot
distcp-1546382661--205240459-new from path /user/user1: the snapshot does not
exist. at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:260)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:296)
{code}
Then we identified the wrong records in the edit log and fixed them manually
{code:java}
The edit causing the problem is "edits_0000000001623400203-0000000001623402627"
and contains 38626 lines when converted to XML format. Further investigation,
we discovered that there are 602 transactions attempting to delete a snapshot
"distcp-1546382661--205240459-new" which does not exist. OP_DELETE_SNAPSHOT
1623401061 /user/user1 distcp-1546382661--205240459-new
31353569-0e2e-4272-9acf-a6b71f51242c 1864 Each transaction consists of above 10
lines, a total of 6020 lines that need to be removed from the original 38626
lines. The no of lines after correction is 38626-6020=32606 .
{code}
Raising the ticket to discuss how to address this corner issue instead of
manually correcting edit logs, for example, there should be a defensive
mechanism in Hadoop but missing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]