[
https://issues.apache.org/jira/browse/HDDS-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841333#comment-17841333
]
Hemant Kumar commented on HDDS-10738:
-------------------------------------
The last system restart was on 2024-04-22 06:21:34 before the system crashed.
{code}
************************************************************/
2024-04-22 03:43:16,133 INFO
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting OzoneManager
STARTUP_MSG: host = vc0113.halxg.cloudera.com/10.17.207.23
STARTUP_MSG: args = []
--
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
2024-04-22 04:22:06,230 INFO
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting OzoneManager
STARTUP_MSG: host = vc0113.halxg.cloudera.com/10.17.207.23
STARTUP_MSG: args = [--init]
--
************************************************************/
2024-04-22 04:22:25,198 INFO
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting OzoneManager
STARTUP_MSG: host = vc0113.halxg.cloudera.com/10.17.207.23
STARTUP_MSG: args = []
--
2024-04-22 06:20:25,151 INFO
[shutdown-hook-0]-org.apache.hadoop.hdds.utils.BackgroundService: Shutting down
service SstFilteringService
2024-04-22 06:21:34,736 INFO
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting OzoneManager
STARTUP_MSG: host = vc0113.halxg.cloudera.com/10.17.207.23
STARTUP_MSG: args = [--init]
--
************************************************************/
2024-04-22 06:21:53,679 INFO
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting OzoneManager
STARTUP_MSG: host = vc0113.halxg.cloudera.com/10.17.207.23
STARTUP_MSG: args = []
{code}
> Unable to load snapshot exception encountered in a LR setup
> -----------------------------------------------------------
>
> Key: HDDS-10738
> URL: https://issues.apache.org/jira/browse/HDDS-10738
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM, Snapshot
> Reporter: Jyotirmoy Sinha
> Assignee: Hemant Kumar
> Priority: Major
> Labels: ozone-snapshot
>
> Scenario :
> * Generate data over parallel threads over various volume/buckets
> * Perform parallel snapshot create/delete/list operations over above buckets
> * Perform parallel snapdiff operations over each bucket
> * Perform parallel read operations of snapshot contents
> * Introduce OM and cluster restarts in between along with DN decommissioning
> and balancer restarts.
> Observation - When multiple threads are running with snapshot operations the
> snapshot path contents are not accessible even after 20 mins
> Snapshot creation log OM -
> {code:java}
> 2024-04-22 11:18:49,123 INFO [OM StateMachine ApplyTransaction Thread -
> 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest:
> Created snapshot: 'snap1713809817' with snapshotId:
> '849a95ab-c5bc-4b78-9d0a-fdad34fd331a' under path 'vol-dp4tz/buck-cp6e6'
> {code}
> OM Error stacktrace -
> {code:java}
> 2024-04-22 11:42:04,768 INFO [IPC Server handler 37 on
> 9862]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Checkpoint
> directory: 60 didn't get created in
> /var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a
> secs.
> 2024-04-22 11:42:04,768 ERROR [IPC Server handler 37 on
> 9862]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve
> snapshot: /vol-dp4tz/buck-cp6e6/snap1713809817
> TIMEOUT org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load
> snapshot. Snapshot checkpoint directory
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
> does not exist yet. Please wait a few more seconds before retrying
> at
> org.apache.hadoop.ozone.om.snapshot.SnapshotUtils.checkSnapshotDirExist(SnapshotUtils.java:113)
> at
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
> at
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
> at
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
> at
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:147)
> at
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
> at
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:143)
> at
> org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
> at
> org.apache.hadoop.ozone.om.OzoneManager.getReader(OzoneManager.java:4634)
> at
> org.apache.hadoop.ozone.om.OzoneManager.getFileStatus(OzoneManager.java:3572)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOzoneFileStatus(OzoneManagerRequestHandler.java:1002)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:257)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:220)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:174)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:143)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> 2024-04-22 11:42:04,770 WARN [IPC Server handler 37 on
> 9862]-org.apache.hadoop.ipc.Server: IPC Server handler 37 on 9862, call
> Call#2 Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 10.17.207.24:57514
> java.lang.IllegalStateException: TIMEOUT
> org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot.
> Snapshot checkpoint directory
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
> does not exist yet. Please wait a few more seconds before retrying {code}
> The above error is coming for multiple snapshots repeatedly and mostly coming
> in parallel snapshot operations across various volume/buckets, not in serial
> operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]