[ 
https://issues.apache.org/jira/browse/HDDS-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDDS-9965.
-----------------------------------
    Resolution: Duplicate

> [Snapshot] SnapshotCache is in inconsistent state
> -------------------------------------------------
>
>                 Key: HDDS-9965
>                 URL: https://issues.apache.org/jira/browse/HDDS-9965
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Snapshot
>            Reporter: Hemant Kumar
>            Assignee: Aswin Shakil
>            Priority: Blocker
>              Labels: ozone-snapshot
>
> Seems like fix for deadlock (in HDDS-9871) caused cache inconsistency. Due to 
> which *SnapDiff* and background services *KeyDeletingService* and 
> *SstFilteringService* are failing.
> *KeyDeletingService* logs:
> {code:java}
> 2023-12-18 19:19:29,563 WARN 
> [KeyDeletingService#0]-org.apache.hadoop.hdds.utils.BackgroundService: 
> Background task execution failed
> java.lang.IllegalStateException: Cache map entry removal failure. The cache 
> is in an inconsistent state. Expected OmSnapshot instance: 
> org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:512)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getLatestActiveSnapshot(OmMetadataManagerImpl.java:1691)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(OmMetadataManagerImpl.java:1544)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:607)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:183)
>         at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834) {code}
> {code:java}
> 2023-12-18 19:20:28,768 WARN 
> [KeyDeletingService#0]-org.apache.hadoop.hdds.utils.BackgroundService: 
> Background task execution failed
> java.lang.IllegalStateException: Cache map entry removal failure. The cache 
> is in an inconsistent state. Expected OmSnapshot instance: 
> org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: 
> org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@7656056
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:512)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getLatestActiveSnapshot(OmMetadataManagerImpl.java:1691)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(OmMetadataManagerImpl.java:1544)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:607)
>         at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:183)
>         at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> 2023-12-18 19:20:28,768 WARN 
> [SstFilteringService#0]-org.apache.hadoop.hdds.utils.BackgroundService: 
> Background task execution failed
> java.lang.IllegalStateException: Cache map entry removal failure. The cache 
> is in an inconsistent state. Expected OmSnapshot instance: 
> org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:512)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
>         at 
> org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
>         at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
> *SSTFilteringService* logs:
> {code:java}
> 2023-12-18 19:21:28,742 ERROR 
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed 
> to retrieve snapshot: /vol-t2gj8/buck-07uux/snap-5griw
> java.io.IOException: Failed init RocksDB, db path : 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b,
>  exception :org.rocksdb.RocksDBException lock hold by current process, 
> acquire time 1702927228 acquiring thread 139777345017600: 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK:
>  No locks available
>         at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180)
>         at 
> org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:598)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:171)
>         at 
> java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:167)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
>         at 
> org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
>         at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.io.IOException: class 
> org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b;
>  status : IOError; message : lock hold by current process, acquire time 
> 1702927228 a
> cquiring thread 139777345017600: 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK:
>  No locks available
>         at 
> org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:667)
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.toIOException(RocksDatabase.java:91)
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:168)
>         at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:113)
>         ... 18 more
> Caused by: org.rocksdb.RocksDBException: lock hold by current process, 
> acquire time 1702927228 acquiring thread 139777345017600: 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK:
>  No locks available
>         at org.rocksdb.RocksDB.open(Native Method)
>         at org.rocksdb.RocksDB.open(RocksDB.java:307)
>         at 
> org.apache.hadoop.hdds.utils.db.managed.ManagedRocksDB.open(ManagedRocksDB.java:75)
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:154)
>         ... 19 more
> 2023-12-18 19:21:28,743 WARN 
> [SstFilteringService#0]-org.apache.hadoop.hdds.utils.BackgroundService: 
> Background task execution failed
> java.lang.IllegalStateException: java.io.IOException: Failed init RocksDB, db 
> path : 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b,
>  exception :org.rocksdb.RocksDBException lock hold by current process, 
> acquire time 1702927228 acqu
> iring thread 139777345017600: 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK:
>  No locks available
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:179)
>         at 
> java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:167)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
>         at 
> org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
>         at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.io.IOException: Failed init RocksDB, db path : 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b,
>  exception :org.rocksdb.RocksDBException lock hold by current process, 
> acquire time 1702927228 acquiring thread 139777345
> 017600: 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK:
>  No locks available
>         at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180)
>         at 
> org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:598)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
>         at 
> org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:171)
>         ... 12 more
> Caused by: java.io.IOException: class 
> org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b;
>  status : IOError; message : lock hold by current process, acquire time 
> 1702927228 a
> cquiring thread 139777345017600: 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK:
>  No locks available
>         at 
> org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:667)
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.toIOException(RocksDatabase.java:91)
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:168)
>         at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:113)
>         ... 18 more
> Caused by: org.rocksdb.RocksDBException: lock hold by current process, 
> acquire time 1702927228 acquiring thread 139777345017600: 
> /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK:
>  No locks available
>         at org.rocksdb.RocksDB.open(Native Method)
>         at org.rocksdb.RocksDB.open(RocksDB.java:307)
>         at 
> org.apache.hadoop.hdds.utils.db.managed.ManagedRocksDB.open(ManagedRocksDB.java:75)
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:154)
>         ... 19 more{code}
> {code:java}
> 2023-12-18 19:19:28,741 ERROR 
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: 
> Exception encountered while filtering a snapshot
> java.io.IOException: Rocks Database is closed
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.assertClose(RocksDatabase.java:445)
>         at 
> org.apache.hadoop.hdds.utils.db.RocksDatabase.deleteFilesNotMatchingPrefix(RocksDatabase.java:958)
>         at 
> org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:193)
>         at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
> *SnapDiff* logs:
> {code:java}
> 2023-12-18 19:33:12,222 ERROR 
> [snapshot-diff-job-thread-id-8]-org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager:
>  Caught unchecked exception during diff report generation for volume: 
> voljngnm bucket: bucketjngnm, fromSnapshot: snap-t1gfn and toSnapshot: 
> snap-h1xmg
> java.lang.IllegalStateException: Cache map entry removal failure. The cache 
> is in an inconsistent state. Expected OmSnapshot instance: 
> org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:512)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.generateSnapshotDiffReport(SnapshotDiffManager.java:841)
>         at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.lambda$2(SnapshotDiffManager.java:722)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
> I suspect it might be a race condition between *cleanupInternal()* and 
> *invalidate()*. It wasn't a problem earlier because they were synchronize by 
> ConcurrentHashMap's lock. But it could be something else.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to