[
https://issues.apache.org/jira/browse/HDDS-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816358#comment-17816358
]
Hemant Kumar commented on HDDS-10275:
-------------------------------------
In the recent occurrence of this issue, there was snapshot delete requests.
Which eventually needs to access SnapshotCache to move keys around inside
{*}OMSnapshotMoveDeletedKeysResponse{*}. It is possible the
*OmDoubleBufferThread* is waiting inside *OMSnapshotMoveDeletedKeysResponse* to
access {*}SnapshotCache{*}. It is visible from SnapshotDeletingService logs
that, it paused around the same time, double buffer flushed stopped,
{code:java}
2024-02-01 20:12:33,642 INFO
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache:
Loading snapshot. Table key:
/s3v/skodemo2024-4/s3v-skodemo2024-4-3c27a7c1-e158-4190-93fa-b738b9b4ea67
2024-02-01 20:12:33,649 INFO
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache:
Loading snapshot. Table key:
/s3v/skodemo2024-4/s3v-skodemo2024-4-3cb3b199-b0e8-4adc-b53f-a58e6cb13b88
2024-02-01 20:13:33,642 WARN
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.OmSnapshotManager:
Snapshot cache size (11) exceeds configured soft-limit (10).
2024-02-01 20:13:33,642 INFO
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache:
Loading snapshot. Table key:
/s3v/skodemo2024-4/s3v-skodemo2024-4-5568d616-9a3a-4e4b-a331-bc63efa30f49
2024-02-01 20:13:33,647 INFO
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache:
Loading snapshot. Table key:
/s3v/skodemo2024-4/s3v-skodemo2024-4-6007b39c-c22a-425b-993c-8e0d51d07d0a
2024-02-01 20:14:33,642 INFO
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache:
Loading snapshot. Table key:
/s3v/skodemo2024-4/s3v-skodemo2024-4-09a4ee79-6b86-4e17-acb3-f4ffe87fe82b
2024-02-01 20:14:33,643 INFO
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache:
Loading snapshot. Table key:
/s3v/skodemo2024-4/s3v-skodemo2024-4-6855bcbf-83d5-4170-a853-a714f1a05bcf
2024-02-01 20:15:33,642 WARN
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.OmSnapshotManager:
Snapshot cache size (11) exceeds configured soft-limit (10).
2024-02-01 20:15:33,642 INFO
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache:
Loading snapshot. Table key:
/s3v/skodemo2024-4/s3v-skodemo2024-4-0988bc35-880a-4824-a283-077781bb4513
2024-02-01 23:43:37,160 INFO
[shutdown-hook-0]-org.apache.hadoop.hdds.utils.BackgroundService: Shutting down
service SnapshotDeletingService
2024-02-01 23:51:03,895 INFO
[shutdown-hook-0]-org.apache.hadoop.hdds.utils.BackgroundService: Shutting down
service SnapshotDeletingService
2024-02-02 01:03:30,000 INFO
[shutdown-hook-0]-org.apache.hadoop.hdds.utils.BackgroundService: Shutting down
service SnapshotDeletingService
{code}
> Double buffer not flushing DB transactions
> ------------------------------------------
>
> Key: HDDS-10275
> URL: https://issues.apache.org/jira/browse/HDDS-10275
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Hemant Kumar
> Priority: Major
>
> While looking into snapshot diff failure because it could not load the
> snapshot because checkpointing dir doesn’t exist. Snapshot creation succeeded
> but checkpointing dir doesn’t exist because it happens inside double buffed
> flush.
> Looked at logs and there was no double buffer flush logs during that time.
> Snapshot creation request:
> {code:java}
> 2023-11-27 00:40:23,345 INFO [OM StateMachine ApplyTransaction Thread -
> 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest:
> Created snapshot: 'snap-ay36z' with snapshotId:
> 'bf0c6141-4185-4361-b15f-c4aa71c5c6d8' under path 'vol-2xd36/buck-id806'
> {code}
> Double Buffer flush logs:
> {code:java}
> ...
> 2023-11-27 00:10:23,826 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
> Created checkpoint in rocksDB at
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-b2e9acb3-fee2-4190-8272-0649edca8d93
> in 30 milliseconds
> 2023-11-27 00:10:23,827 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils:
> Waited for 1 milliseconds for checkpoint directory
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-b2e9acb3-fee2-4190-8272-0649edca8d93
> availability.
> 2023-11-27 00:10:23,828 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager:
> Created checkpoint :
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-b2e9acb3-fee2-4190-8272-0649edca8d93
> for snapshot snap-mswq9
> 2023-11-27 00:10:39,586 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
> Created checkpoint in rocksDB at
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3369ac3a-61e1-4eca-b3cf-eb2de0b2d688
> in 30 milliseconds
> 2023-11-27 00:10:39,586 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils:
> Waited for 0 milliseconds for checkpoint directory
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3369ac3a-61e1-4eca-b3cf-eb2de0b2d688
> availability.
> 2023-11-27 00:10:39,587 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager:
> Created checkpoint :
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3369ac3a-61e1-4eca-b3cf-eb2de0b2d688
> for snapshot snap-f5u3t
> 2023-11-27 00:10:55,949 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
> Created checkpoint in rocksDB at
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3a690c8f-f3ef-415d-b25c-3aaf763c9507
> in 22 milliseconds
> 2023-11-27 00:10:55,950 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils:
> Waited for 1 milliseconds for checkpoint directory
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3a690c8f-f3ef-415d-b25c-3aaf763c9507
> availability.
> 2023-11-27 00:10:55,950 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager:
> Created checkpoint :
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3a690c8f-f3ef-415d-b25c-3aaf763c9507
> for snapshot snap-jfktn
> 2023-11-29 08:52:24,698 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
> Created checkpoint in rocksDB at
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-c3ba17ef-d947-454e-9c4f-b9063ae65650
> in 15 milliseconds
> 2023-11-29 08:52:24,715 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils:
> Waited for 16 milliseconds for checkpoint directory
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-c3ba17ef-d947-454e-9c4f-b9063ae65650
> availability.
> 2023-11-29 08:52:24,717 WARN
> [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager:
> Took 614733 ns to find endKey. Caller is
> deleteKeysFromDelKeyTableInSnapshotScope
> 2023-11-29 08:52:24,718 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager:
> Created checkpoint :
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-c3ba17ef-d947-454e-9c4f-b9063ae65650
> for snapshot snap-ay36z
> 2023-11-29 08:52:24,745 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager:
> Created checkpoint in rocksDB at
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-bf0c6141-4185-4361-b15f-c4aa71c5c6d8
> in 12 milliseconds
> 2023-11-29 08:52:24,746 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils:
> Waited for 0 milliseconds for checkpoint directory
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-bf0c6141-4185-4361-b15f-c4aa71c5c6d8
> availability.
> 2023-11-29 08:52:24,747 INFO
> [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager:
> Created checkpoint :
> /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-bf0c6141-4185-4361-b15f-c4aa71c5c6d8
> for snapshot snap-ay36z
> ...
> {code}
> Also looked if double buffer thread was terminated or paused but no log
> exists for that as well. I looked at the logs for the whole hour between last
> double buffer flush and check-pointing was not created. Couldn’t find any
> issue in that as well.
> On follower nodes, double buffer were working properly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]