[
https://issues.apache.org/jira/browse/HDDS-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841324#comment-17841324
]
Hemant Kumar commented on HDDS-10738:
-------------------------------------
I looked into logs on the OM logs on which we saw the issue. I found the double
buffer flush thread was lagging.
Few create snapshot requests log just before and after double buffer didn't
happen :
{code}
2024-04-22 10:45:32,176 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap192' with snapshotId: '278ac149-0c16-43fa-9518-1a1a65d13f8c'
under path 'voltest21713793040/buck1'
2024-04-22 10:45:32,975 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap407' with snapshotId: '536377a2-9400-4606-b6bc-5723e93c2a84'
under path 'voltest21713768541/buck1'
2024-04-22 10:45:33,276 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap258' with snapshotId: '97c1e436-cc27-4512-a157-28e93219889c'
under path 'voltest21713782436/buck1'
2024-04-22 10:45:42,447 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap192' with snapshotId: 'd22c5a57-f509-41cc-8b97-0317d86153a7'
under path 'voltest21713793041/buck1'
2024-04-22 10:45:42,449 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap192' with snapshotId: 'f9b9033b-81ef-4e35-ae35-b347bd55e505'
under path 'voltest21713793039/buck1'
2024-04-22 10:45:46,085 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap258' with snapshotId: '305aa7ce-0c5f-4ed8-98f5-d9c84dc79c01'
under path 'voltest21713782425/buck1'
2024-04-22 10:45:46,543 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap259' with snapshotId: 'cce1ea43-c703-46c9-a3f3-edf882fcf38e'
under path 'voltest21713782432/buck1'
2024-04-22 10:45:48,214 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap1552' with snapshotId: '60b6d6d2-186f-472a-a76d-797aa0bc6cf7'
under path 'voltest21713338338/buck1'
2024-04-22 10:45:53,211 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap1548' with snapshotId: 'a1d3c828-baa9-4421-aa96-e68cf6b2c224'
under path 'voltest21713338336/buck1'
2024-04-22 10:46:06,377 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap1552' with snapshotId: '875beaeb-745f-41e2-8673-28eb521afbe1'
under path 'voltest21713338330/buck1'
2024-04-22 10:46:08,014 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created
snapshot: 'snap259' with snapshotId: '1b941cc4-412e-4073-8bba-eeb60a6842ba'
under path 'voltest21713782428/buck1'
{code}
Corresponding double buffer flush:
{code}
2024-04-22 13:07:11,011 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-278ac149-0c16-43fa-9518-1a1a65d13f8c
in 1134 milliseconds
2024-04-22 13:08:12,380 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-536377a2-9400-4606-b6bc-5723e93c2a84
in 1298 milliseconds
2024-04-22 13:10:12,459 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-97c1e436-cc27-4512-a157-28e93219889c
in 1286 milliseconds
2024-04-22 13:11:13,571 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-d22c5a57-f509-41cc-8b97-0317d86153a7
in 1011 milliseconds
2024-04-22 13:12:15,040 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-f9b9033b-81ef-4e35-ae35-b347bd55e505
in 1367 milliseconds
2024-04-22 13:13:16,626 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-305aa7ce-0c5f-4ed8-98f5-d9c84dc79c01
in 1518 milliseconds
2024-04-22 13:14:18,002 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-cce1ea43-c703-46c9-a3f3-edf882fcf38e
in 1265 milliseconds
2024-04-22 13:15:19,206 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60b6d6d2-186f-472a-a76d-797aa0bc6cf7
in 1123 milliseconds
2024-04-22 13:16:20,533 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-a1d3c828-baa9-4421-aa96-e68cf6b2c224
in 1227 milliseconds
{code}
Snapshot: 'snap1548' with snapshotId: 'a1d3c828-baa9-4421-aa96-e68cf6b2c224'
under path 'voltest21713338336/buck1' was created on 2024-04-22 10:45:53 and
double buffer was flushed for this snapshot on 2024-04-22 13:16:20, after more
than 2 hours.
No double buffer happened for after this snapshot.
The last few double buffer flushes for snapshot before the OM node crashed:
{code}
...
[root@vc0113 hadoop-ozone]# grep "2024-04-22 13" ozone-om.log | grep "Created
checkpoint in rocksDB at"
2024-04-22 13:00:03,295 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-b30c613d-0548-4535-ba47-7450b27a2ec9
in 972 milliseconds
2024-04-22 13:01:04,346 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-94bdeb84-7bb6-4840-a417-5bd4e2818d4e
in 964 milliseconds
2024-04-22 13:02:05,366 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-51d345d4-06f1-439c-a9ff-b6439093f757
in 907 milliseconds
2024-04-22 13:03:06,552 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-dc882064-7bb8-4809-a22d-d83a7c129a12
in 1087 milliseconds
2024-04-22 13:04:07,662 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-777efb72-19f9-4a26-9a64-287ff2a73cf7
in 1030 milliseconds
2024-04-22 13:05:08,838 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-0975d08e-cb0f-428d-a6b7-ed7e22fd5845
in 1097 milliseconds
2024-04-22 13:06:09,759 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-039da5a8-2ac7-4c80-8858-31441d0e29b6
in 867 milliseconds
2024-04-22 13:07:11,011 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-278ac149-0c16-43fa-9518-1a1a65d13f8c
in 1134 milliseconds
2024-04-22 13:08:12,380 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-536377a2-9400-4606-b6bc-5723e93c2a84
in 1298 milliseconds
2024-04-22 13:10:12,459 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-97c1e436-cc27-4512-a157-28e93219889c
in 1286 milliseconds
2024-04-22 13:11:13,571 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-d22c5a57-f509-41cc-8b97-0317d86153a7
in 1011 milliseconds
2024-04-22 13:12:15,040 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-f9b9033b-81ef-4e35-ae35-b347bd55e505
in 1367 milliseconds
2024-04-22 13:13:16,626 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-305aa7ce-0c5f-4ed8-98f5-d9c84dc79c01
in 1518 milliseconds
2024-04-22 13:14:18,002 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-cce1ea43-c703-46c9-a3f3-edf882fcf38e
in 1265 milliseconds
2024-04-22 13:15:19,206 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60b6d6d2-186f-472a-a76d-797aa0bc6cf7
in 1123 milliseconds
2024-04-22 13:16:20,533 INFO
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
Created checkpoint in rocksDB at
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-a1d3c828-baa9-4421-aa96-e68cf6b2c224
in 1227 milliseconds
...
{code}
> Unable to load snapshot exception encountered in a LR setup
> -----------------------------------------------------------
>
> Key: HDDS-10738
> URL: https://issues.apache.org/jira/browse/HDDS-10738
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM, Snapshot
> Reporter: Jyotirmoy Sinha
> Assignee: Hemant Kumar
> Priority: Major
> Labels: ozone-snapshot
>
> Scenario :
> * Generate data over parallel threads over various volume/buckets
> * Perform parallel snapshot create/delete/list operations over above buckets
> * Perform parallel snapdiff operations over each bucket
> * Perform parallel read operations of snapshot contents
> * Introduce OM and cluster restarts in between along with DN decommissioning
> and balancer restarts.
> Observation - When multiple threads are running with snapshot operations the
> snapshot path contents are not accessible even after 20 mins
> Snapshot creation log OM -
> {code:java}
> 2024-04-22 11:18:49,123 INFO [OM StateMachine ApplyTransaction Thread -
> 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest:
> Created snapshot: 'snap1713809817' with snapshotId:
> '849a95ab-c5bc-4b78-9d0a-fdad34fd331a' under path 'vol-dp4tz/buck-cp6e6'
> {code}
> OM Error stacktrace -
> {code:java}
> 2024-04-22 11:42:04,768 INFO [IPC Server handler 37 on
> 9862]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Checkpoint
> directory: 60 didn't get created in
> /var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a
> secs.
> 2024-04-22 11:42:04,768 ERROR [IPC Server handler 37 on
> 9862]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve
> snapshot: /vol-dp4tz/buck-cp6e6/snap1713809817
> TIMEOUT org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load
> snapshot. Snapshot checkpoint directory
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
> does not exist yet. Please wait a few more seconds before retrying
> at
> org.apache.hadoop.ozone.om.snapshot.SnapshotUtils.checkSnapshotDirExist(SnapshotUtils.java:113)
> at
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
> at
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
> at
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
> at
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:147)
> at
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
> at
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:143)
> at
> org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
> at
> org.apache.hadoop.ozone.om.OzoneManager.getReader(OzoneManager.java:4634)
> at
> org.apache.hadoop.ozone.om.OzoneManager.getFileStatus(OzoneManager.java:3572)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOzoneFileStatus(OzoneManagerRequestHandler.java:1002)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:257)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:220)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:174)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:143)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> 2024-04-22 11:42:04,770 WARN [IPC Server handler 37 on
> 9862]-org.apache.hadoop.ipc.Server: IPC Server handler 37 on 9862, call
> Call#2 Retry#0
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from
> 10.17.207.24:57514
> java.lang.IllegalStateException: TIMEOUT
> org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot.
> Snapshot checkpoint directory
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
> does not exist yet. Please wait a few more seconds before retrying {code}
> The above error is coming for multiple snapshots repeatedly and mostly coming
> in parallel snapshot operations across various volume/buckets, not in serial
> operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]