[
https://issues.apache.org/jira/browse/HDFS-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Srinivasu Majeti updated HDFS-17399:
------------------------------------
Description:
One of the customers is facing 'resource' issues ( max number of processes ) at
least on one of the Namenodes.
{code:java}
host02: > As a result, Snapshot creation failed on 14th: 2023-05-14
10:41:28,233 WARN org.apache.hadoop.ipc.Server: IPC Server handler 22 on 8020,
call Call#11 Retry#0
org.apache.hadoop.hdfs.protocol.ClientProtocol.createSnapshot from
xx.xxx.xx.xxx:59442 java.lang.OutOfMemoryError: unable to create native thread:
possibly out of memory or process/resource limits reached at
java.base/java.lang.Thread.start0(Native Method) at
java.base/java.lang.Thread.start(Thread.java:803) at
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
at
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
at
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
at
org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodeWithLeases(LeaseManager.java:246)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.addSnapshot(DirectorySnapshottableFeature.java:211)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSnapshot(INodeDirectory.java:288)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:463)
at
org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.createSnapshot(FSDirSnapshotOp.java:110)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSnapshot(FSNamesystem.java:6767)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSnapshot(NameNodeRpcServer.java:1871)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1273)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno
{code}
{code:java} host02 log (NN log) 2023-05-14 10:42:49,983 INFO
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream:
Fast-forwarding stream
'http://host03.amd.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true,
http://host02.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
to transaction ID 1623400203 2023-05-14 10:42:49,983 INFO
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream:
Fast-forwarding stream
'http://host01.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
to transaction ID 1623400203 2023-05-14 10:42:50,011 ERROR
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
on operation DeleteSnapshotOp [snapshotRoot=/user/user1,
snapshotName=distcp-1546382661--205240459-new,
RpcClientId=31353569-0e2e-4272-9acf-a6b71f51242c, RpcCallId=18]
org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot delete snapshot
distcp-1546382661--205240459-new from path /user/user1: the snapshot does not
exist. at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:260)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:296)
{code}
Then we identified the wrong records in the edit log and fixed them manually
{code:java}
The edit causing the problem is "edits_0000000001623400203-0000000001623402627"
and contains 38626 lines when converted to XML format. Further investigation,
we discovered that there are 602 transactions attempting to delete a snapshot
"distcp-1546382661--205240459-new" which does not exist. OP_DELETE_SNAPSHOT
1623401061 /user/user1 distcp-1546382661--205240459-new
31353569-0e2e-4272-9acf-a6b71f51242c 1864 Each transaction consists of above 10
lines, a total of 6020 lines that need to be removed from the original 38626
lines. The no of lines after correction is 38626-6020=32606 .
{code}
Raising the ticket to discuss how to address this corner issue instead of
manually correcting edit logs, for example, there should be a defensive
mechanism in Hadoop but missing.
was:
One of the customers is facing 'resource' issues ( max number of processes ) at
least on one of the Namenodes.
{code:java}
host02: > As a result, Snapshot creation failed on 14th: 2023-05-14
10:41:28,233 WARN org.apache.hadoop.ipc.Server: IPC Server handler 22 on 8020,
call Call#11 Retry#0
org.apache.hadoop.hdfs.protocol.ClientProtocol.createSnapshot from
xx.xxx.xx.xxx:59442 java.lang.OutOfMemoryError: unable to create native thread:
possibly out of memory or process/resource limits reached at
java.base/java.lang.Thread.start0(Native Method) at
java.base/java.lang.Thread.start(Thread.java:803) at
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
at
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
at
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
at
org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodeWithLeases(LeaseManager.java:246)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.addSnapshot(DirectorySnapshottableFeature.java:211)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSnapshot(INodeDirectory.java:288)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:463)
at
org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.createSnapshot(FSDirSnapshotOp.java:110)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSnapshot(FSNamesystem.java:6767)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSnapshot(NameNodeRpcServer.java:1871)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1273)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno
\{code} \{code:java} host02 log (NN log) 2023-05-14 10:42:49,983 INFO
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream:
Fast-forwarding stream
'http://host03.amd.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true,
http://host02.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
to transaction ID 1623400203 2023-05-14 10:42:49,983 INFO
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream:
Fast-forwarding stream
'http://host01.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
to transaction ID 1623400203 2023-05-14 10:42:50,011 ERROR
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
on operation DeleteSnapshotOp [snapshotRoot=/user/user1,
snapshotName=distcp-1546382661--205240459-new,
RpcClientId=31353569-0e2e-4272-9acf-a6b71f51242c, RpcCallId=18]
org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot delete snapshot
distcp-1546382661--205240459-new from path /user/user1: the snapshot does not
exist. at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:260)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:296)
{code}
Then we identified the wrong records in the edit log and fixed them manually
{code:java}
The edit causing the problem is "edits_0000000001623400203-0000000001623402627"
and contains 38626 lines when converted to XML format. Further investigation,
we discovered that there are 602 transactions attempting to delete a snapshot
"distcp-1546382661--205240459-new" which does not exist. OP_DELETE_SNAPSHOT
1623401061 /user/user1 distcp-1546382661--205240459-new
31353569-0e2e-4272-9acf-a6b71f51242c 1864 Each transaction consists of above 10
lines, a total of 6020 lines that need to be removed from the original 38626
lines. The no of lines after correction is 38626-6020=32606 .
{code}
Raising the ticket to discuss how to address this corner issue instead of
manually correcting edit logs, for example, there should be a defensive
mechanism in Hadoop but missing.
> Ensure atomic transactions when snapshot manager is facing OS resource limit
> issues
> -----------------------------------------------------------------------------------
>
> Key: HDFS-17399
> URL: https://issues.apache.org/jira/browse/HDFS-17399
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: snapshots
> Affects Versions: 3.1.1
> Reporter: Srinivasu Majeti
> Priority: Major
>
> One of the customers is facing 'resource' issues ( max number of processes )
> at least on one of the Namenodes.
> {code:java}
>
> host02: > As a result, Snapshot creation failed on 14th: 2023-05-14
> 10:41:28,233 WARN org.apache.hadoop.ipc.Server: IPC Server handler 22 on
> 8020, call Call#11 Retry#0
> org.apache.hadoop.hdfs.protocol.ClientProtocol.createSnapshot from
> xx.xxx.xx.xxx:59442 java.lang.OutOfMemoryError: unable to create native
> thread: possibly out of memory or process/resource limits reached at
> java.base/java.lang.Thread.start0(Native Method) at
> java.base/java.lang.Thread.start(Thread.java:803) at
> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
> at
> java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
> at
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodeWithLeases(LeaseManager.java:246)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.addSnapshot(DirectorySnapshottableFeature.java:211)
> at
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSnapshot(INodeDirectory.java:288)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:463)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.createSnapshot(FSDirSnapshotOp.java:110)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSnapshot(FSNamesystem.java:6767)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSnapshot(NameNodeRpcServer.java:1871)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1273)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno
>
> {code}
> {code:java} host02 log (NN log) 2023-05-14 10:42:49,983 INFO
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream:
> Fast-forwarding stream
> 'http://host03.amd.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true,
>
> http://host02.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
> to transaction ID 1623400203 2023-05-14 10:42:49,983 INFO
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream:
> Fast-forwarding stream
> 'http://host01.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
> to transaction ID 1623400203 2023-05-14 10:42:50,011 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
> on operation DeleteSnapshotOp [snapshotRoot=/user/user1,
> snapshotName=distcp-1546382661--205240459-new,
> RpcClientId=31353569-0e2e-4272-9acf-a6b71f51242c, RpcCallId=18]
> org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot delete snapshot
> distcp-1546382661--205240459-new from path /user/user1: the snapshot does not
> exist. at
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:260)
> at
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:296)
>
> {code}
> Then we identified the wrong records in the edit log and fixed them manually
> {code:java}
> The edit causing the problem is
> "edits_0000000001623400203-0000000001623402627" and contains 38626 lines when
> converted to XML format. Further investigation, we discovered that there are
> 602 transactions attempting to delete a snapshot
> "distcp-1546382661--205240459-new" which does not exist. OP_DELETE_SNAPSHOT
> 1623401061 /user/user1 distcp-1546382661--205240459-new
> 31353569-0e2e-4272-9acf-a6b71f51242c 1864 Each transaction consists of above
> 10 lines, a total of 6020 lines that need to be removed from the original
> 38626 lines. The no of lines after correction is 38626-6020=32606 .
> {code}
> Raising the ticket to discuss how to address this corner issue instead of
> manually correcting edit logs, for example, there should be a defensive
> mechanism in Hadoop but missing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]