Jyotirmoy Sinha created HDDS-10738:
--------------------------------------

             Summary: Unable to load snapshot exception encountered in a LR 
setup
                 Key: HDDS-10738
                 URL: https://issues.apache.org/jira/browse/HDDS-10738
             Project: Apache Ozone
          Issue Type: Bug
          Components: OM, Snapshot
            Reporter: Jyotirmoy Sinha


Scenario :
 * Generate data over parallel threads over various volume/buckets
 * Perform parallel snapshot create/delete/list operations over above buckets
 * Perform parallel snapdiff operations over each bucket
 * Perform parallel read operations of snapshot contents
 * Introduce OM and cluster restarts in between along with DN decommissioning 
and balancer restarts.

Observation - When multiple threads are running with snapshot operations the 
snapshot path contents are not accessible even after 20 mins

Snapshot creation log OM -
{code:java}
2024-04-22 11:18:49,123 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap1713809817' with snapshotId: 
'849a95ab-c5bc-4b78-9d0a-fdad34fd331a' under path 'vol-dp4tz/buck-cp6e6' {code}
OM Error stacktrace - 
{code:java}
2024-04-22 11:42:04,768 INFO [IPC Server handler 37 on 
9862]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Checkpoint directory: 
60 didn't get created in 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a
 secs.
2024-04-22 11:42:04,768 ERROR [IPC Server handler 37 on 
9862]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve 
snapshot: /vol-dp4tz/buck-cp6e6/snap1713809817
TIMEOUT org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load 
snapshot. Snapshot checkpoint directory 
'/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
 does not exist yet. Please wait a few more seconds before retrying
        at 
org.apache.hadoop.ozone.om.snapshot.SnapshotUtils.checkSnapshotDirExist(SnapshotUtils.java:113)
        at 
org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
        at 
org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
        at 
org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
        at 
org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:147)
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
        at 
org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:143)
        at 
org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
        at 
org.apache.hadoop.ozone.om.OzoneManager.getReader(OzoneManager.java:4634)
        at 
org.apache.hadoop.ozone.om.OzoneManager.getFileStatus(OzoneManager.java:3572)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOzoneFileStatus(OzoneManagerRequestHandler.java:1002)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:257)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:220)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:174)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:143)
        at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
2024-04-22 11:42:04,770 WARN [IPC Server handler 37 on 
9862]-org.apache.hadoop.ipc.Server: IPC Server handler 37 on 9862, call Call#2 
Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest 
from 10.17.207.24:57514
java.lang.IllegalStateException: TIMEOUT 
org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot. 
Snapshot checkpoint directory 
'/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
 does not exist yet. Please wait a few more seconds before retrying {code}
The above error is coming for multiple snapshots repeatedly and mostly coming 
in parallel snapshot operations across various volume/buckets, not in serial 
operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to