[ 
https://issues.apache.org/jira/browse/HDDS-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841330#comment-17841330
 ] 

Hemant Kumar commented on HDDS-10738:
-------------------------------------

Another thing to note is that threads, to get OM snapshot for lagging 
followers, are on hold and eventually failing on OM crash. Not request 
succeeded after 2024-04-22 10:47:31.

{code}
[root@vc0113 hadoop-ozone]# grep "DBCheckpointServlet" ozone-om.log
2024-04-21 21:58:52,066 INFO 
[qtp1170570685-7577]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Received 
GET request to obtain DB checkpoint snapshot
2024-04-21 22:03:37,190 INFO 
[qtp1170570685-7577]-org.apache.hadoop.ozone.om.OMDBCheckpointServlet: 
Compaction pausing 1 started.
2024-04-21 22:03:38,320 INFO 
[qtp1170570685-7577]-org.apache.hadoop.ozone.om.OMDBCheckpointServlet: 
Compaction pausing 1 ended. Elapsed ms: 1130
2024-04-21 22:03:38,737 ERROR 
[qtp1170570685-7577]-org.apache.hadoop.ozone.om.OMDBCheckpointServlet: got 
exception writing to archive java.io.IOException: Close 
SendCallback@254ca09a[PROCESSING][i=HTTP/1.1{s=200,h=10,cl=-1},cb=org.eclipse.jetty.server.HttpChannel$SendCallback@628f03da]
 in state PROCESSING
2024-04-21 22:03:38,737 ERROR 
[qtp1170570685-7577]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
                at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet.writeDbDataToStream(OMDBCheckpointServlet.java:176)
                at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:220)
                at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-21 22:08:57,113 INFO 
[qtp1170570685-21590]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-21 22:09:31,821 INFO 
[qtp1170570685-21590]-org.apache.hadoop.ozone.om.OMDBCheckpointServlet: 
Compaction pausing 2 started.
2024-04-21 22:09:33,287 INFO 
[qtp1170570685-21590]-org.apache.hadoop.ozone.om.OMDBCheckpointServlet: 
Compaction pausing 2 ended. Elapsed ms: 1466
2024-04-21 22:09:33,691 ERROR 
[qtp1170570685-21590]-org.apache.hadoop.ozone.om.OMDBCheckpointServlet: got 
exception writing to archive org.eclipse.jetty.io.EofException
2024-04-21 22:09:33,691 ERROR 
[qtp1170570685-21590]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet.writeFilesToArchive(OMDBCheckpointServlet.java:591)
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet.writeDbDataToStream(OMDBCheckpointServlet.java:174)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:220)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
                at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet.writeDbDataToStream(OMDBCheckpointServlet.java:176)
2024-04-21 22:19:02,195 INFO 
[qtp1170570685-6287]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Received 
GET request to obtain DB checkpoint snapshot
2024-04-21 22:19:02,379 INFO 
[qtp1170570685-6287]-org.apache.hadoop.ozone.om.OMDBCheckpointServlet: 
Compaction pausing 3 started.
2024-04-21 22:19:04,519 INFO 
[qtp1170570685-6287]-org.apache.hadoop.ozone.om.OMDBCheckpointServlet: 
Compaction pausing 3 ended. Elapsed ms: 2140
2024-04-21 22:19:09,467 INFO 
[qtp1170570685-6287]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Time 
taken to write the checkpoint to response output stream: 4948 milliseconds
2024-04-21 22:19:09,467 INFO 
[qtp1170570685-6287]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Excluded 
SST [] from the latest checkpoint.
2024-04-22 03:44:10,427 INFO 
[qtp1421547398-15329]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 03:54:15,483 INFO 
[qtp1421547398-15576]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 04:04:20,565 INFO 
[qtp1421547398-15331]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 04:14:25,645 INFO 
[qtp1421547398-16116]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 10:47:31,842 INFO 
[qtp560715723-671567]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 10:57:36,898 INFO 
[qtp560715723-672419]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 11:07:41,992 INFO 
[qtp560715723-672879]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 11:17:47,063 INFO 
[qtp560715723-331804]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 11:27:52,163 INFO 
[qtp560715723-674263]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 11:37:57,244 INFO 
[qtp560715723-676183]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 11:48:02,328 INFO 
[qtp560715723-661725]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 11:58:07,414 INFO 
[qtp560715723-671346]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 12:09:33,664 INFO 
[qtp560715723-677026]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 12:21:38,587 INFO 
[qtp560715723-676994]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 12:46:51,334 INFO 
[qtp560715723-676269]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 13:16:20,604 INFO 
[qtp560715723-678006]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: 
Received GET request to obtain DB checkpoint snapshot
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-676269]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-676994]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-676183]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,075 ERROR 
[qtp560715723-671346]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-677026]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-331804]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-671567]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:660)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-678006]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-661725]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-672419]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-674263]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
2024-04-22 13:18:40,074 ERROR 
[qtp560715723-672879]-org.apache.hadoop.hdds.utils.DBCheckpointServlet: Unable 
to process metadata snapshot request.
        at 
org.apache.hadoop.ozone.om.OMDBCheckpointServlet$Lock.lock(OMDBCheckpointServlet.java:654)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.generateSnapshotCheckpoint(DBCheckpointServlet.java:197)
        at 
org.apache.hadoop.hdds.utils.DBCheckpointServlet.doGet(DBCheckpointServlet.java:303)
{code}

> Unable to load snapshot exception encountered in a LR setup
> -----------------------------------------------------------
>
>                 Key: HDDS-10738
>                 URL: https://issues.apache.org/jira/browse/HDDS-10738
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM, Snapshot
>            Reporter: Jyotirmoy Sinha
>            Assignee: Hemant Kumar
>            Priority: Major
>              Labels: ozone-snapshot
>
> Scenario :
>  * Generate data over parallel threads over various volume/buckets
>  * Perform parallel snapshot create/delete/list operations over above buckets
>  * Perform parallel snapdiff operations over each bucket
>  * Perform parallel read operations of snapshot contents
>  * Introduce OM and cluster restarts in between along with DN decommissioning 
> and balancer restarts.
> Observation - When multiple threads are running with snapshot operations the 
> snapshot path contents are not accessible even after 20 mins
> Snapshot creation log OM -
> {code:java}
> 2024-04-22 11:18:49,123 INFO [OM StateMachine ApplyTransaction Thread - 
> 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: 
> Created snapshot: 'snap1713809817' with snapshotId: 
> '849a95ab-c5bc-4b78-9d0a-fdad34fd331a' under path 'vol-dp4tz/buck-cp6e6' 
> {code}
> OM Error stacktrace - 
> {code:java}
> 2024-04-22 11:42:04,768 INFO [IPC Server handler 37 on 
> 9862]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Checkpoint 
> directory: 60 didn't get created in 
> /var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a
>  secs.
> 2024-04-22 11:42:04,768 ERROR [IPC Server handler 37 on 
> 9862]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve 
> snapshot: /vol-dp4tz/buck-cp6e6/snap1713809817
> TIMEOUT org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load 
> snapshot. Snapshot checkpoint directory 
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
>  does not exist yet. Please wait a few more seconds before retrying
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotUtils.checkSnapshotDirExist(SnapshotUtils.java:113)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:147)
>         at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:143)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.getReader(OzoneManager.java:4634)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.getFileStatus(OzoneManager.java:3572)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOzoneFileStatus(OzoneManagerRequestHandler.java:1002)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:257)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:220)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:174)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:143)
>         at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> 2024-04-22 11:42:04,770 WARN [IPC Server handler 37 on 
> 9862]-org.apache.hadoop.ipc.Server: IPC Server handler 37 on 9862, call 
> Call#2 Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 10.17.207.24:57514
> java.lang.IllegalStateException: TIMEOUT 
> org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot. 
> Snapshot checkpoint directory 
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
>  does not exist yet. Please wait a few more seconds before retrying {code}
> The above error is coming for multiple snapshots repeatedly and mostly coming 
> in parallel snapshot operations across various volume/buckets, not in serial 
> operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to