[
https://issues.apache.org/jira/browse/HDDS-12210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931336#comment-17931336
]
Hemant Kumar commented on HDDS-12210:
-------------------------------------
It seems like the same issue as HDDS-12385. I looked at a couple of examples
one from the description and one another, If you see the snapshot was purged
successfully in `OMSnapshotPurgeRequest` but when SDS tried to open again while
DoubleBuffer is getting flushed, the was a failure.
A snapshot is purged successfully in validateAndUpdateCache of
OMSnapshotPurgeRequest:
{code:java}
2025-02-04 12:09:55,573 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotPurgeRequest:
Successfully executed snapshotPurgeRequest: {snapshotDBKeys:
"/ota/ozdls3_ota_va14/cm-tmp-15d62c55-9463-4d5f-b724-3817cb623dae"
} along with updating
snapshots:{/ota/ozdls3_ota_va20/cm-tmp-a03e2fc4-fdeb-4cab-aad7-9723b04ae12d=SnapshotInfo{snapshotId:
'22028d7f-a1c4-4882-9787-f72dd27f335d', name:
'cm-tmp-a03e2fc4-fdeb-4cab-aad7-9723b04ae12d', volumeName: 'ota', bucketName:
'ozdls3_ota_va20', snapshotStatus: 'SNAPSHOT_DELETED', creationTime:
'1737272460221', deletionTime: '1737275231780', pathPreviousSnapshotId:
'85137feb-4dff-440e-baf9-c3bdcadfb798', globalPreviousSnapshotId:
'18f25f59-346b-4e15-b9eb-ba56878579b9', snapshotPath: 'ota/ozdls3_ota_va20',
checkpointDir: '-22028d7f-a1c4-4882-9787-f72dd27f335d', dbTxSequenceNumber:
'5468255349', deepClean: 'true', sstFiltered: 'false'},
/ota/ozdls3_ota_va14/cm-tmp-22dd7822-4db2-46c6-a122-c1e1a0e96993=SnapshotInfo{snapshotId:
'dc5984d5-6f4f-4dee-bb49-e80391b2daa1', name:
'cm-tmp-22dd7822-4db2-46c6-a122-c1e1a0e96993', volumeName: 'ota', bucketName:
'ozdls3_ota_va14', snapshotStatus: 'SNAPSHOT_DELETED', creationTime:
'1737298757547', deletionTime: '1737302311966', pathPreviousSnapshotId:
'8cb8fa9b-0176-4a59-b9cc-ca4cb2838e3b', globalPreviousSnapshotId:
'fd23933c-8c35-41f3-a61f-bf2bac9a6087', snapshotPath: 'ota/ozdls3_ota_va14',
checkpointDir: '-dc5984d5-6f4f-4dee-bb49-e80391b2daa1', dbTxSequenceNumber:
'5479427839', deepClean: 'true', sstFiltered: 'false'},
/ota/ozdls3_ota_va14/cm-1546385766-1737773882172-11=SnapshotInfo{snapshotId:
'62bf6f10-efeb-4112-a359-cf02fb98afd2', name: 'cm-1546385766-1737773882172-11',
volumeName: 'ota', bucketName: 'ozdls3_ota_va14', snapshotStatus:
'SNAPSHOT_ACTIVE', creationTime: '1738344122546', deletionTime: '-1',
pathPreviousSnapshotId: 'a2f77d97-3931-4f39-8afb-f4c354765e0e',
globalPreviousSnapshotId: '6184ea57-00df-469b-86bb-e9d71ab2e384', snapshotPath:
'ota/ozdls3_ota_va14', checkpointDir: '-62bf6f10-efeb-4112-a359-cf02fb98afd2',
dbTxSequenceNumber: '5804642318', deepClean: 'true', sstFiltered: 'false'},
/ota/ozdls3_ota_va14/cm-tmp-15d62c55-9463-4d5f-b724-3817cb623dae=SnapshotInfo{snapshotId:
'60e7673b-6a97-4960-a522-b63cb113d016', name:
'cm-tmp-15d62c55-9463-4d5f-b724-3817cb623dae', volumeName: 'ota', bucketName:
'ozdls3_ota_va14', snapshotStatus: 'SNAPSHOT_DELETED', creationTime:
'1737255564931', deletionTime: '1737259016079', pathPreviousSnapshotId:
'8cb8fa9b-0176-4a59-b9cc-ca4cb2838e3b', globalPreviousSnapshotId:
'18f25f59-346b-4e15-b9eb-ba56878579b9', snapshotPath: 'ota/ozdls3_ota_va14',
checkpointDir: '-60e7673b-6a97-4960-a522-b63cb113d016', dbTxSequenceNumber:
'5466629436', deepClean: 'true', sstFiltered: 'false'}}.{code}
Later on, SnapshotDeletingService tried to open the same snapshot and failed:
{code:java}
2025-02-04 12:09:55,634 ERROR
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.OmSnapshotManager:
Failed to retrieve snapshot:
/ota/ozdls3_ota_va14/cm-tmp-15d62c55-9463-4d5f-b724-3817cb623dae
java.io.IOException: Failed init RocksDB, db path :
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60e7673b-6a97-4960-a522-b63cb113d016,
exception :org.rocksdb.RocksDBException Corruption: IO error: No such file or
directory: While open a file for random read:
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60e7673b-6a97-4960-a522-b63cb113d016/719149.ldb:
No such file or directory in file
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60e7673b-6a97-4960-a522-b63cb113d016/MANIFEST-720881
at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180)
at
org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220)
at
org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:589)
at
org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:402)
at
org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:360)
at
org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
at
org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:147)
at
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
at
org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:143)
at
org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:616)
at
org.apache.hadoop.ozone.om.service.SnapshotDeletingService$SnapshotDeletingTask.call(SnapshotDeletingService.java:169)
at
org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
at
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750){code}
The corresponding DoubleBuffer log:
{code:java}
2025-02-04 12:09:55,921 ERROR
[OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotPurgeResponse:
Failed to delete snapshot directory
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60e7673b-6a97-4960-a522-b63cb113d016
for snapshot /ota/ozdls3_ota_va14/cm-tmp-15d62c55-9463-4d5f-b724-3817cb623dae
java.nio.file.DirectoryNotEmptyException:
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60e7673b-6a97-4960-a522-b63cb113d016
at
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242)
at
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
at java.nio.file.Files.delete(Files.java:1126)
at org.apache.commons.io.FileUtils.delete(FileUtils.java:1175)
at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1194)
at
org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotPurgeResponse.deleteCheckpointDirectory(OMSnapshotPurgeResponse.java:130)
at
org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotPurgeResponse.addToDBBatch(OMSnapshotPurgeResponse.java:100)
at
org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:73)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$5(OzoneManagerDoubleBuffer.java:382)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:220)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatch(OzoneManagerDoubleBuffer.java:381)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushBatch(OzoneManagerDoubleBuffer.java:324)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushCurrentBuffer(OzoneManagerDoubleBuffer.java:297)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:262)
at java.lang.Thread.run(Thread.java:750){code}
*Another example:*
Successfully in validateAndUpdateCache of OMSnapshotPurgeRequest:
{code:java}
2025-02-04 12:55:32,375 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotPurgeRequest:
Successfully executed snapshotPurgeRequest: {snapshotDBKeys:
"/ota/ozdls3_ota_va14/cm-tmp-2b51a42e-1ece-4b27-b7f8-4bb7ad550281"
} along with updating
snapshots:{/ota/ozdls3_ota_va8/cm-1546385728-1735953302412-1=SnapshotInfo{snapshotId:
'5c847ff0-8f6b-404b-9df9-d9092361cac7', name: 'cm-1546385728-1735953302412-1',
volumeName: 'ota', bucketName: 'ozdls3_ota_va8', snapshotStatus:
'SNAPSHOT_DELETED', creationTime: '1736011578605', deletionTime:
'1736229626552', pathPreviousSnapshotId:
'b7d61a28-3e57-427f-92fe-f5f4b7258621', globalPreviousSnapshotId:
'b7d61a28-3e57-427f-92fe-f5f4b7258621', snapshotPath: 'ota/ozdls3_ota_va8',
checkpointDir: '-5c847ff0-8f6b-404b-9df9-d9092361cac7', dbTxSequenceNumber:
'5029594305', deepClean: 'true', sstFiltered: 'false'},
/ota/ozdls3_ota_va14/cm-tmp-2b51a42e-1ece-4b27-b7f8-4bb7ad550281=SnapshotInfo{snapshotId:
'6e45f5f1-51b6-4321-8f91-110326aa584d', name:
'cm-tmp-2b51a42e-1ece-4b27-b7f8-4bb7ad550281', volumeName: 'ota', bucketName:
'ozdls3_ota_va14', snapshotStatus: 'SNAPSHOT_DELETED', creationTime:
'1736002766663', deletionTime: '1736002778619', pathPreviousSnapshotId:
'd7ddb410-d32f-4b43-b2cc-8ac3509ad2de', globalPreviousSnapshotId:
'b7d61a28-3e57-427f-92fe-f5f4b7258621', snapshotPath: 'ota/ozdls3_ota_va14',
checkpointDir: '-6e45f5f1-51b6-4321-8f91-110326aa584d', dbTxSequenceNumber:
'5029222483', deepClean: 'true', sstFiltered: 'false'},
/ota/ozdls3_ota_va14/cm-1546385766-1737773882172-11=SnapshotInfo{snapshotId:
'62bf6f10-efeb-4112-a359-cf02fb98afd2', name: 'cm-1546385766-1737773882172-11',
volumeName: 'ota', bucketName: 'ozdls3_ota_va14', snapshotStatus:
'SNAPSHOT_ACTIVE', creationTime: '1738344122546', deletionTime: '-1',
pathPreviousSnapshotId: 'a2f77d97-3931-4f39-8afb-f4c354765e0e',
globalPreviousSnapshotId: '6184ea57-00df-469b-86bb-e9d71ab2e384', snapshotPath:
'ota/ozdls3_ota_va14', checkpointDir: '-62bf6f10-efeb-4112-a359-cf02fb98afd2',
dbTxSequenceNumber: '5804642318', deepClean: 'true', sstFiltered: 'false'},
/ota/ozdls3_ota_va14/cm-tmp-7aff84ce-4530-4011-ba4c-8be8a192a437=SnapshotInfo{snapshotId:
'e90f5cbf-4a4c-4111-b329-663f49c6bf1e', name:
'cm-tmp-7aff84ce-4530-4011-ba4c-8be8a192a437', volumeName: 'ota', bucketName:
'ozdls3_ota_va14', snapshotStatus: 'SNAPSHOT_DELETED', creationTime:
'1736045969833', deletionTime: '1736045981689', pathPreviousSnapshotId:
'd7ddb410-d32f-4b43-b2cc-8ac3509ad2de', globalPreviousSnapshotId:
'1fe28717-9fa9-462b-9bcb-1bda77e865b3', snapshotPath: 'ota/ozdls3_ota_va14',
checkpointDir: '-e90f5cbf-4a4c-4111-b329-663f49c6bf1e', dbTxSequenceNumber:
'5035116883', deepClean: 'true', sstFiltered: 'false'}}. {code}
SnapshotDeletingService log:
{code:java}
2025-02-04 12:55:32,417 ERROR
[SnapshotDeletingService#0]-org.apache.hadoop.ozone.om.OmSnapshotManager:
Failed to retrieve snapshot:
/ota/ozdls3_ota_va14/cm-tmp-2b51a42e-1ece-4b27-b7f8-4bb7ad550281
java.io.IOException: Failed init RocksDB, db path :
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-6e45f5f1-51b6-4321-8f91-110326aa584d,
exception :org.rocksdb.RocksDBException Corruption: IO error: No such file or
directory: While open a file for random read:
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-6e45f5f1-51b6-4321-8f91-110326aa584d/685604.ldb:
No such file or directory in file
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-6e45f5f1-51b6-4321-8f91-110326aa584d/MANIFEST-688430
at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180)
at
org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220)
at
org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:589)
at
org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:402)
at
org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:360)
at
org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
at
org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:147)
at
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1892)
at
org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:143)
at
org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:616)
at
org.apache.hadoop.ozone.om.service.SnapshotDeletingService$SnapshotDeletingTask.call(SnapshotDeletingService.java:169)
at
org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
at
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) {code}
DoubleBuffer log:
{code:java}
2025-02-04 12:55:32,666 ERROR
[OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotPurgeResponse:
Failed to delete snapshot directory
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-6e45f5f1-51b6-4321-8f91-110326aa584d
for snapshot /ota/ozdls3_ota_va14/cm-tmp-2b51a42e-1ece-4b27-b7f8-4bb7ad550281
java.nio.file.DirectoryNotEmptyException:
/data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-6e45f5f1-51b6-4321-8f91-110326aa584d
at
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242)
at
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
at java.nio.file.Files.delete(Files.java:1126)
at org.apache.commons.io.FileUtils.delete(FileUtils.java:1175)
at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1194)
at
org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotPurgeResponse.deleteCheckpointDirectory(OMSnapshotPurgeResponse.java:130)
at
org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotPurgeResponse.addToDBBatch(OMSnapshotPurgeResponse.java:100)
at
org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:73)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$5(OzoneManagerDoubleBuffer.java:382)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:220){code}
> Tarball creation interfering with snapshot purge
> -------------------------------------------------
>
> Key: HDDS-12210
> URL: https://issues.apache.org/jira/browse/HDDS-12210
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Hemant Kumar
> Assignee: Hemant Kumar
> Priority: Major
>
> If tarball creation is in the process while the snapshot is getting purged,
> it fails the snapshot db dir delete command. Because of that snapshot db dir
> lingers around even tho snapshot is purged form the snapshotInfoTable and
> needs manual intervention to delete the dir.
> {code}
> 2025-02-04 12:09:55,921 ERROR
> [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotPurgeResponse:
> Failed to delete snapshot directory
> /data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60e7673b-6a97-4960-a522-b63cb113d016
> for snapshot /ota/ozdls3_ota_va14/cm-tmp-15d62c55-9463-4d5f-b724-3817cb623dae
> java.nio.file.DirectoryNotEmptyException:
> /data/meta01/hadoop-ozone/om/data/db.snapshots/checkpointState/om.db-60e7673b-6a97-4960-a522-b63cb113d016
> at
> sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242)
> at
> sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
> at java.nio.file.Files.delete(Files.java:1126)
> at org.apache.commons.io.FileUtils.delete(FileUtils.java:1175)
> at
> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1194)
> at
> org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotPurgeResponse.deleteCheckpointDirectory(OMSnapshotPurgeResponse.java:130)
> at
> org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotPurgeResponse.addToDBBatch(OMSnapshotPurgeResponse.java:100)
> at
> org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:73)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$5(OzoneManagerDoubleBuffer.java:382)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:220)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatch(OzoneManagerDoubleBuffer.java:381)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushBatch(OzoneManagerDoubleBuffer.java:324)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushCurrentBuffer(OzoneManagerDoubleBuffer.java:297)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:262)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> This task is to come up with an approach to get rid of manual intervention.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]