[ 
https://issues.apache.org/jira/browse/HDDS-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841324#comment-17841324
 ] 

Hemant Kumar edited comment on HDDS-10738 at 4/26/24 5:51 PM:
--------------------------------------------------------------

I looked into logs on the OM logs on which we saw the issue. I found the double 
buffer flush thread was lagging.

Few create snapshot requests log just before and after double buffer didn't 
happen :
{code}
...
2024-04-22 10:45:32,176 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap192' with snapshotId: '278ac149-0c16-43fa-9518-1a1a65d13f8c' 
under path 'voltest21713793040/buck1'
2024-04-22 10:45:32,975 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap407' with snapshotId: '536377a2-9400-4606-b6bc-5723e93c2a84' 
under path 'voltest21713768541/buck1'
2024-04-22 10:45:33,276 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap258' with snapshotId: '97c1e436-cc27-4512-a157-28e93219889c' 
under path 'voltest21713782436/buck1'
2024-04-22 10:45:42,447 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap192' with snapshotId: 'd22c5a57-f509-41cc-8b97-0317d86153a7' 
under path 'voltest21713793041/buck1'
2024-04-22 10:45:42,449 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap192' with snapshotId: 'f9b9033b-81ef-4e35-ae35-b347bd55e505' 
under path 'voltest21713793039/buck1'
2024-04-22 10:45:46,085 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap258' with snapshotId: '305aa7ce-0c5f-4ed8-98f5-d9c84dc79c01' 
under path 'voltest21713782425/buck1'
2024-04-22 10:45:46,543 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap259' with snapshotId: 'cce1ea43-c703-46c9-a3f3-edf882fcf38e' 
under path 'voltest21713782432/buck1'
2024-04-22 10:45:48,214 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap1552' with snapshotId: '60b6d6d2-186f-472a-a76d-797aa0bc6cf7' 
under path 'voltest21713338338/buck1'
2024-04-22 10:45:53,211 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap1548' with snapshotId: 'a1d3c828-baa9-4421-aa96-e68cf6b2c224' 
under path 'voltest21713338336/buck1'
2024-04-22 10:46:06,377 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap1552' with snapshotId: '875beaeb-745f-41e2-8673-28eb521afbe1' 
under path 'voltest21713338330/buck1'
2024-04-22 10:46:08,014 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap259' with snapshotId: '1b941cc4-412e-4073-8bba-eeb60a6842ba' 
under path 'voltest21713782428/buck1'
...
{code}

Corresponding double buffer flush:
{code}
...
2024-04-22 13:07:11,011 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-278ac149-0c16-43fa-9518-1a1a65d13f8c
 in 1134 milliseconds
2024-04-22 13:08:12,380 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-536377a2-9400-4606-b6bc-5723e93c2a84
 in 1298 milliseconds
2024-04-22 13:10:12,459 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-97c1e436-cc27-4512-a157-28e93219889c
 in 1286 milliseconds
2024-04-22 13:11:13,571 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-d22c5a57-f509-41cc-8b97-0317d86153a7
 in 1011 milliseconds
2024-04-22 13:12:15,040 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-f9b9033b-81ef-4e35-ae35-b347bd55e505
 in 1367 milliseconds
2024-04-22 13:13:16,626 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-305aa7ce-0c5f-4ed8-98f5-d9c84dc79c01
 in 1518 milliseconds
2024-04-22 13:14:18,002 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-cce1ea43-c703-46c9-a3f3-edf882fcf38e
 in 1265 milliseconds
2024-04-22 13:15:19,206 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60b6d6d2-186f-472a-a76d-797aa0bc6cf7
 in 1123 milliseconds
2024-04-22 13:16:20,533 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-a1d3c828-baa9-4421-aa96-e68cf6b2c224
 in 1227 milliseconds
...
{code}

Snapshot: 'snap1548' with snapshotId: 'a1d3c828-baa9-4421-aa96-e68cf6b2c224' 
under path 'voltest21713338336/buck1' was created on 2024-04-22 10:45:53 and 
double buffer was flushed for this snapshot on  2024-04-22 13:16:20, after more 
than 2 hours. 

No double buffer happened for after this snapshot.

The last few double buffer flushes for snapshot before the OM node crashed:
{code}
[root@vc0113 hadoop-ozone]# grep "2024-04-22 13" ozone-om.log | grep "Created 
checkpoint in rocksDB at"
2024-04-22 13:00:03,295 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-b30c613d-0548-4535-ba47-7450b27a2ec9
 in 972 milliseconds
2024-04-22 13:01:04,346 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-94bdeb84-7bb6-4840-a417-5bd4e2818d4e
 in 964 milliseconds
2024-04-22 13:02:05,366 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-51d345d4-06f1-439c-a9ff-b6439093f757
 in 907 milliseconds
2024-04-22 13:03:06,552 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-dc882064-7bb8-4809-a22d-d83a7c129a12
 in 1087 milliseconds
2024-04-22 13:04:07,662 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-777efb72-19f9-4a26-9a64-287ff2a73cf7
 in 1030 milliseconds
2024-04-22 13:05:08,838 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-0975d08e-cb0f-428d-a6b7-ed7e22fd5845
 in 1097 milliseconds
2024-04-22 13:06:09,759 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-039da5a8-2ac7-4c80-8858-31441d0e29b6
 in 867 milliseconds
2024-04-22 13:07:11,011 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-278ac149-0c16-43fa-9518-1a1a65d13f8c
 in 1134 milliseconds
2024-04-22 13:08:12,380 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-536377a2-9400-4606-b6bc-5723e93c2a84
 in 1298 milliseconds
2024-04-22 13:10:12,459 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-97c1e436-cc27-4512-a157-28e93219889c
 in 1286 milliseconds
2024-04-22 13:11:13,571 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-d22c5a57-f509-41cc-8b97-0317d86153a7
 in 1011 milliseconds
2024-04-22 13:12:15,040 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-f9b9033b-81ef-4e35-ae35-b347bd55e505
 in 1367 milliseconds
2024-04-22 13:13:16,626 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-305aa7ce-0c5f-4ed8-98f5-d9c84dc79c01
 in 1518 milliseconds
2024-04-22 13:14:18,002 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-cce1ea43-c703-46c9-a3f3-edf882fcf38e
 in 1265 milliseconds
2024-04-22 13:15:19,206 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60b6d6d2-186f-472a-a76d-797aa0bc6cf7
 in 1123 milliseconds
2024-04-22 13:16:20,533 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-a1d3c828-baa9-4421-aa96-e68cf6b2c224
 in 1227 milliseconds
{code}


was (Author: JIRAUSER297350):
I looked into logs on the OM logs on which we saw the issue. I found the double 
buffer flush thread was lagging.

Few create snapshot requests log just before and after double buffer didn't 
happen :
{code}
2024-04-22 10:45:32,176 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap192' with snapshotId: '278ac149-0c16-43fa-9518-1a1a65d13f8c' 
under path 'voltest21713793040/buck1'
2024-04-22 10:45:32,975 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap407' with snapshotId: '536377a2-9400-4606-b6bc-5723e93c2a84' 
under path 'voltest21713768541/buck1'
2024-04-22 10:45:33,276 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap258' with snapshotId: '97c1e436-cc27-4512-a157-28e93219889c' 
under path 'voltest21713782436/buck1'
2024-04-22 10:45:42,447 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap192' with snapshotId: 'd22c5a57-f509-41cc-8b97-0317d86153a7' 
under path 'voltest21713793041/buck1'
2024-04-22 10:45:42,449 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap192' with snapshotId: 'f9b9033b-81ef-4e35-ae35-b347bd55e505' 
under path 'voltest21713793039/buck1'
2024-04-22 10:45:46,085 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap258' with snapshotId: '305aa7ce-0c5f-4ed8-98f5-d9c84dc79c01' 
under path 'voltest21713782425/buck1'
2024-04-22 10:45:46,543 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap259' with snapshotId: 'cce1ea43-c703-46c9-a3f3-edf882fcf38e' 
under path 'voltest21713782432/buck1'
2024-04-22 10:45:48,214 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap1552' with snapshotId: '60b6d6d2-186f-472a-a76d-797aa0bc6cf7' 
under path 'voltest21713338338/buck1'
2024-04-22 10:45:53,211 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap1548' with snapshotId: 'a1d3c828-baa9-4421-aa96-e68cf6b2c224' 
under path 'voltest21713338336/buck1'
2024-04-22 10:46:06,377 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap1552' with snapshotId: '875beaeb-745f-41e2-8673-28eb521afbe1' 
under path 'voltest21713338330/buck1'
2024-04-22 10:46:08,014 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created 
snapshot: 'snap259' with snapshotId: '1b941cc4-412e-4073-8bba-eeb60a6842ba' 
under path 'voltest21713782428/buck1'
{code}

Corresponding double buffer flush:
{code}
2024-04-22 13:07:11,011 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-278ac149-0c16-43fa-9518-1a1a65d13f8c
 in 1134 milliseconds
2024-04-22 13:08:12,380 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-536377a2-9400-4606-b6bc-5723e93c2a84
 in 1298 milliseconds
2024-04-22 13:10:12,459 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-97c1e436-cc27-4512-a157-28e93219889c
 in 1286 milliseconds
2024-04-22 13:11:13,571 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-d22c5a57-f509-41cc-8b97-0317d86153a7
 in 1011 milliseconds
2024-04-22 13:12:15,040 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-f9b9033b-81ef-4e35-ae35-b347bd55e505
 in 1367 milliseconds
2024-04-22 13:13:16,626 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-305aa7ce-0c5f-4ed8-98f5-d9c84dc79c01
 in 1518 milliseconds
2024-04-22 13:14:18,002 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-cce1ea43-c703-46c9-a3f3-edf882fcf38e
 in 1265 milliseconds
2024-04-22 13:15:19,206 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60b6d6d2-186f-472a-a76d-797aa0bc6cf7
 in 1123 milliseconds
2024-04-22 13:16:20,533 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-a1d3c828-baa9-4421-aa96-e68cf6b2c224
 in 1227 milliseconds
{code}

Snapshot: 'snap1548' with snapshotId: 'a1d3c828-baa9-4421-aa96-e68cf6b2c224' 
under path 'voltest21713338336/buck1' was created on 2024-04-22 10:45:53 and 
double buffer was flushed for this snapshot on  2024-04-22 13:16:20, after more 
than 2 hours. 

No double buffer happened for after this snapshot.

The last few double buffer flushes for snapshot before the OM node crashed:
{code}
...
[root@vc0113 hadoop-ozone]# grep "2024-04-22 13" ozone-om.log | grep "Created 
checkpoint in rocksDB at"
2024-04-22 13:00:03,295 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-b30c613d-0548-4535-ba47-7450b27a2ec9
 in 972 milliseconds
2024-04-22 13:01:04,346 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-94bdeb84-7bb6-4840-a417-5bd4e2818d4e
 in 964 milliseconds
2024-04-22 13:02:05,366 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-51d345d4-06f1-439c-a9ff-b6439093f757
 in 907 milliseconds
2024-04-22 13:03:06,552 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-dc882064-7bb8-4809-a22d-d83a7c129a12
 in 1087 milliseconds
2024-04-22 13:04:07,662 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-777efb72-19f9-4a26-9a64-287ff2a73cf7
 in 1030 milliseconds
2024-04-22 13:05:08,838 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-0975d08e-cb0f-428d-a6b7-ed7e22fd5845
 in 1097 milliseconds
2024-04-22 13:06:09,759 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-039da5a8-2ac7-4c80-8858-31441d0e29b6
 in 867 milliseconds
2024-04-22 13:07:11,011 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-278ac149-0c16-43fa-9518-1a1a65d13f8c
 in 1134 milliseconds
2024-04-22 13:08:12,380 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-536377a2-9400-4606-b6bc-5723e93c2a84
 in 1298 milliseconds
2024-04-22 13:10:12,459 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-97c1e436-cc27-4512-a157-28e93219889c
 in 1286 milliseconds
2024-04-22 13:11:13,571 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-d22c5a57-f509-41cc-8b97-0317d86153a7
 in 1011 milliseconds
2024-04-22 13:12:15,040 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-f9b9033b-81ef-4e35-ae35-b347bd55e505
 in 1367 milliseconds
2024-04-22 13:13:16,626 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-305aa7ce-0c5f-4ed8-98f5-d9c84dc79c01
 in 1518 milliseconds
2024-04-22 13:14:18,002 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-cce1ea43-c703-46c9-a3f3-edf882fcf38e
 in 1265 milliseconds
2024-04-22 13:15:19,206 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60b6d6d2-186f-472a-a76d-797aa0bc6cf7
 in 1123 milliseconds
2024-04-22 13:16:20,533 INFO 
[OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
 Created checkpoint in rocksDB at 
/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-a1d3c828-baa9-4421-aa96-e68cf6b2c224
 in 1227 milliseconds
...
{code}

> Unable to load snapshot exception encountered in a LR setup
> -----------------------------------------------------------
>
>                 Key: HDDS-10738
>                 URL: https://issues.apache.org/jira/browse/HDDS-10738
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM, Snapshot
>            Reporter: Jyotirmoy Sinha
>            Assignee: Hemant Kumar
>            Priority: Major
>              Labels: ozone-snapshot
>
> Scenario :
>  * Generate data over parallel threads over various volume/buckets
>  * Perform parallel snapshot create/delete/list operations over above buckets
>  * Perform parallel snapdiff operations over each bucket
>  * Perform parallel read operations of snapshot contents
>  * Introduce OM and cluster restarts in between along with DN decommissioning 
> and balancer restarts.
> Observation - When multiple threads are running with snapshot operations the 
> snapshot path contents are not accessible even after 20 mins
> Snapshot creation log OM -
> {code:java}
> 2024-04-22 11:18:49,123 INFO [OM StateMachine ApplyTransaction Thread - 
> 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: 
> Created snapshot: 'snap1713809817' with snapshotId: 
> '849a95ab-c5bc-4b78-9d0a-fdad34fd331a' under path 'vol-dp4tz/buck-cp6e6' 
> {code}
> OM Error stacktrace - 
> {code:java}
> 2024-04-22 11:42:04,768 INFO [IPC Server handler 37 on 
> 9862]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Checkpoint 
> directory: 60 didn't get created in 
> /var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a
>  secs.
> 2024-04-22 11:42:04,768 ERROR [IPC Server handler 37 on 
> 9862]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve 
> snapshot: /vol-dp4tz/buck-cp6e6/snap1713809817
> TIMEOUT org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load 
> snapshot. Snapshot checkpoint directory 
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
>  does not exist yet. Please wait a few more seconds before retrying
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotUtils.checkSnapshotDirExist(SnapshotUtils.java:113)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:147)
>         at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:143)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.getReader(OzoneManager.java:4634)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.getFileStatus(OzoneManager.java:3572)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOzoneFileStatus(OzoneManagerRequestHandler.java:1002)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:257)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:220)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:174)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:143)
>         at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> 2024-04-22 11:42:04,770 WARN [IPC Server handler 37 on 
> 9862]-org.apache.hadoop.ipc.Server: IPC Server handler 37 on 9862, call 
> Call#2 Retry#0 
> org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 
> 10.17.207.24:57514
> java.lang.IllegalStateException: TIMEOUT 
> org.apache.hadoop.ozone.om.exceptions.OMException: Unable to load snapshot. 
> Snapshot checkpoint directory 
> '/var/lib/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-849a95ab-c5bc-4b78-9d0a-fdad34fd331a'
>  does not exist yet. Please wait a few more seconds before retrying {code}
> The above error is coming for multiple snapshots repeatedly and mostly coming 
> in parallel snapshot operations across various volume/buckets, not in serial 
> operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to