[ 
https://issues.apache.org/jira/browse/HDFS-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivasu Majeti updated HDFS-17399:
------------------------------------
    Description: 
One of the customers is facing 'resource' issues ( max number of processes ) at 
least on one of the Namenodes.

{code:java}
 
host02: > As a result, Snapshot creation failed on 14th: 2023-05-14 
10:41:28,233 WARN org.apache.hadoop.ipc.Server: IPC Server handler 22 on 8020, 
call Call#11 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.createSnapshot from 
xx.xxx.xx.xxx:59442 java.lang.OutOfMemoryError: unable to create native thread: 
possibly out of memory or process/resource limits reached at 
java.base/java.lang.Thread.start0(Native Method) at 
java.base/java.lang.Thread.start(Thread.java:803) at 
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
 at 
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
 at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodeWithLeases(LeaseManager.java:246)
 at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.addSnapshot(DirectorySnapshottableFeature.java:211)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSnapshot(INodeDirectory.java:288)
 at 
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:463)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.createSnapshot(FSDirSnapshotOp.java:110)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSnapshot(FSNamesystem.java:6767)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSnapshot(NameNodeRpcServer.java:1871)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1273)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno
 
{code} 

{code:java} host02 log (NN log) 2023-05-14 10:42:49,983 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 
'http://host03.amd.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true,
 
http://host02.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
 to transaction ID 1623400203 2023-05-14 10:42:49,983 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 
'http://host01.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
 to transaction ID 1623400203 2023-05-14 10:42:50,011 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation DeleteSnapshotOp [snapshotRoot=/user/user1, 
snapshotName=distcp-1546382661--205240459-new, 
RpcClientId=31353569-0e2e-4272-9acf-a6b71f51242c, RpcCallId=18] 
org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot delete snapshot 
distcp-1546382661--205240459-new from path /user/user1: the snapshot does not 
exist. at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:260)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:296)
 
{code}
Then we identified the wrong records in the edit log and fixed them manually

{code:java}
The edit causing the problem is "edits_0000000001623400203-0000000001623402627" 
and contains 38626 lines when converted to XML format. Further investigation, 
we discovered that there are 602 transactions attempting to delete a snapshot 
"distcp-1546382661--205240459-new" which does not exist. OP_DELETE_SNAPSHOT 
1623401061 /user/user1 distcp-1546382661--205240459-new 
31353569-0e2e-4272-9acf-a6b71f51242c 1864 Each transaction consists of above 10 
lines, a total of 6020 lines that need to be removed from the original 38626 
lines. The no of lines after correction is 38626-6020=32606 . 
{code}

Raising the ticket to discuss how to address this corner issue instead of 
manually correcting edit logs, for example, there should be a defensive 
mechanism in Hadoop but missing.

  was:
One of the customers is facing 'resource' issues ( max number of processes ) at 
least on one of the Namenodes.

{code:java} 
host02: > As a result, Snapshot creation failed on 14th: 2023-05-14 
10:41:28,233 WARN org.apache.hadoop.ipc.Server: IPC Server handler 22 on 8020, 
call Call#11 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.createSnapshot from 
xx.xxx.xx.xxx:59442 java.lang.OutOfMemoryError: unable to create native thread: 
possibly out of memory or process/resource limits reached at 
java.base/java.lang.Thread.start0(Native Method) at 
java.base/java.lang.Thread.start(Thread.java:803) at 
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
 at 
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
 at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodeWithLeases(LeaseManager.java:246)
 at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.addSnapshot(DirectorySnapshottableFeature.java:211)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSnapshot(INodeDirectory.java:288)
 at 
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:463)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.createSnapshot(FSDirSnapshotOp.java:110)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSnapshot(FSNamesystem.java:6767)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSnapshot(NameNodeRpcServer.java:1871)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1273)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno
 \{code} \{code:java} host02 log (NN log) 2023-05-14 10:42:49,983 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 
'http://host03.amd.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true,
 
http://host02.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
 to transaction ID 1623400203 2023-05-14 10:42:49,983 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 
'http://host01.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
 to transaction ID 1623400203 2023-05-14 10:42:50,011 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation DeleteSnapshotOp [snapshotRoot=/user/user1, 
snapshotName=distcp-1546382661--205240459-new, 
RpcClientId=31353569-0e2e-4272-9acf-a6b71f51242c, RpcCallId=18] 
org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot delete snapshot 
distcp-1546382661--205240459-new from path /user/user1: the snapshot does not 
exist. at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:260)
 at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:296)
 
{code} 
Then we identified the wrong records in the edit log and fixed them manually 
{code:java} 
The edit causing the problem is "edits_0000000001623400203-0000000001623402627" 
and contains 38626 lines when converted to XML format. Further investigation, 
we discovered that there are 602 transactions attempting to delete a snapshot 
"distcp-1546382661--205240459-new" which does not exist. OP_DELETE_SNAPSHOT 
1623401061 /user/user1 distcp-1546382661--205240459-new 
31353569-0e2e-4272-9acf-a6b71f51242c 1864 Each transaction consists of above 10 
lines, a total of 6020 lines that need to be removed from the original 38626 
lines. The no of lines after correction is 38626-6020=32606 . 
{code} 
Raising the ticket to discuss how to address this corner issue instead of 
manually correcting edit logs, for example, there should be a defensive 
mechanism in Hadoop but missing.


> Ensure atomic transactions when snapshot manager is facing OS resource limit 
> issues
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-17399
>                 URL: https://issues.apache.org/jira/browse/HDFS-17399
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 3.1.1
>            Reporter: Srinivasu Majeti
>            Priority: Major
>
> One of the customers is facing 'resource' issues ( max number of processes ) 
> at least on one of the Namenodes.
> {code:java}
>  
> host02: > As a result, Snapshot creation failed on 14th: 2023-05-14 
> 10:41:28,233 WARN org.apache.hadoop.ipc.Server: IPC Server handler 22 on 
> 8020, call Call#11 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.createSnapshot from 
> xx.xxx.xx.xxx:59442 java.lang.OutOfMemoryError: unable to create native 
> thread: possibly out of memory or process/resource limits reached at 
> java.base/java.lang.Thread.start0(Native Method) at 
> java.base/java.lang.Thread.start(Thread.java:803) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
>  at 
> java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
>  at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodeWithLeases(LeaseManager.java:246)
>  at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.addSnapshot(DirectorySnapshottableFeature.java:211)
>  at 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSnapshot(INodeDirectory.java:288)
>  at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:463)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.createSnapshot(FSDirSnapshotOp.java:110)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createSnapshot(FSNamesystem.java:6767)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.createSnapshot(NameNodeRpcServer.java:1871)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.createSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1273)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno
>  
> {code} 
> {code:java} host02 log (NN log) 2023-05-14 10:42:49,983 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 
> 'http://host03.amd.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true,
>  
> http://host02.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
>  to transaction ID 1623400203 2023-05-14 10:42:49,983 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 
> 'http://host01.domain.com:8480/getJournal?jid=cdp01ha&segmentTxId=1623400203&storageInfo=-64%3A1444325792%3A1600117814333%3Acluster1546333019&inProgressOk=true'
>  to transaction ID 1623400203 2023-05-14 10:42:50,011 ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
> on operation DeleteSnapshotOp [snapshotRoot=/user/user1, 
> snapshotName=distcp-1546382661--205240459-new, 
> RpcClientId=31353569-0e2e-4272-9acf-a6b71f51242c, RpcCallId=18] 
> org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot delete snapshot 
> distcp-1546382661--205240459-new from path /user/user1: the snapshot does not 
> exist. at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.removeSnapshot(DirectorySnapshottableFeature.java:260)
>  at 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.removeSnapshot(INodeDirectory.java:296)
>  
> {code}
> Then we identified the wrong records in the edit log and fixed them manually
> {code:java}
> The edit causing the problem is 
> "edits_0000000001623400203-0000000001623402627" and contains 38626 lines when 
> converted to XML format. Further investigation, we discovered that there are 
> 602 transactions attempting to delete a snapshot 
> "distcp-1546382661--205240459-new" which does not exist. OP_DELETE_SNAPSHOT 
> 1623401061 /user/user1 distcp-1546382661--205240459-new 
> 31353569-0e2e-4272-9acf-a6b71f51242c 1864 Each transaction consists of above 
> 10 lines, a total of 6020 lines that need to be removed from the original 
> 38626 lines. The no of lines after correction is 38626-6020=32606 . 
> {code}
> Raising the ticket to discuss how to address this corner issue instead of 
> manually correcting edit logs, for example, there should be a defensive 
> mechanism in Hadoop but missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to