[ 
https://issues.apache.org/jira/browse/HDDS-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uma Maheswara Rao G resolved HDDS-9126.
---------------------------------------
    Resolution: Fixed

Resolving this as corresponding PR is merged.

> [ozone-snapshot] Unordered deletion of snapshots corrupting OM
> --------------------------------------------------------------
>
>                 Key: HDDS-9126
>                 URL: https://issues.apache.org/jira/browse/HDDS-9126
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Manager
>            Reporter: Soumitra Sulav
>            Assignee: Sadanand Shenoy
>            Priority: Critical
>              Labels: ozone-snapshot, pull-request-available
>             Fix For: 1.4.0
>
>         Attachments: console.log, ozone-om-quasar-csvjze-1.log, 
> ozone-om-quasar-csvjze-2.log, ozone-om-quasar-csvjze-3.log, 
> ozone-scm-quasar-csvjze-1.log, ozone-scm-quasar-csvjze-2.log, 
> ozone-scm-quasar-csvjze-3.log
>
>
> Test scenario :
> The test test_unordered_deletion is trying to delete snapshots in random 
> order. And while doing so, we are hitting below exception with OM more often 
> than not.
> Once the error is seen, the OM goes into an unhealthy state, and all the 
> tests after this couldn't run.
> Snapshot is deleted :
> {code:java}
> 2023-08-06 06:33:27,113 INFO [OM StateMachine ApplyTransaction Thread - 
> 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotDeleteRequest: 
> Deleted snapshot 'snap-ae5or' under path 'vol-w19gk/buck-f9sqw'
> {code}
> And soon after during copy
> {code:java}
> 2023-08-06 06:39:06,314|INFO|MainThread|machine.py:188 - 
> run()||GUID=5210f279-e5c7-4ee9-b652-b49a6b0eb07a|RUNNING: 
> /opt/cloudera/parcels/CDH/bin/ozone fs -cp 
> ofs://ozone1/vol-w19gk/buck-f9sqw/.snapshot/snap-5qmtv/key_1691303390 
> ofs://ozone1/vol-w19gk/buck-f9sqw/
> {code}
> OM log stacktrace:
> {code:java}
> 2023-08-06 06:33:38,126 INFO 
> [SstFilteringService#0]-org.apache.hadoop.hdds.utils.db.RocksDatabase: 
> Deleting sst file /000396.sst corresponding to column family keyTable from 
> db: 
> /var/lib/hadoop-ozone/om/data293349/db.snapshots/checkpointState/om.db-0ccb08e9-c5ab-45bb-a71e-8444a2142511
> 2023-08-06 06:33:38,127 INFO 
> [SstFilteringService#0]-org.apache.hadoop.hdds.utils.db.managed.ManagedRocksObjectUtils:
>  Waited for 1 milliseconds for file 
> /var/lib/hadoop-ozone/om/data293349/db.snapshots/checkpointState/om.db-0ccb08e9-c5ab-45bb-a71e-8444a2142511/000396.sst
>  deletion.
> 2023-08-06 06:34:37,938 INFO 
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache: 
> Loading snapshot. Table key: /vol-w19gk/buck-f9sqw/snap-ae5or
> 2023-08-06 06:34:37,938 INFO 
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.helpers.OmKeyInfo: 
> OmKeyInfo.getCodec ignorePipeline = true
> 2023-08-06 06:34:37,989 ERROR 
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: Error 
> during Snapshot sst filtering
> FILE_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Unable to 
> load snapshot. Snapshot with table key '/vol-w19gk/buck-f9sqw/snap-ae5or' is 
> no longer active
>     at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:205)
>     at 
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:151)
>     at 
> org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:178)
>     at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>     at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> 2023-08-06 06:35:30,232 INFO 
> [pool-8-thread-1]-org.apache.ozone.rocksdiff.RocksDBCheckpointDiffer: 
> Removing SST files: [000410, 000453, 000496, 000253, 000374, 000535, 000611, 
> 000456, 000417, 000658, 000338, 000459, 000380, 000185, 000124, 000245, 
> 000443, 000200, 000563, 000364, 000562, 000128, 000447, 000248, 000688, 
> 000324, 000522, 000367, 000209, 000407, 000129, 000602, 000290, 000296, 
> 000692, 000130, 000372, 000690, 000172, 000293, 000157, 000355, 000399, 
> 000674, 000233, 000277, 000310, 000398, 000552, 000596, 000474, 000352, 
> 000550, 000315, 000359, 000634, 000236, 000599, 000554, 000638, 000637, 
> 000559, 000514, 000518, 000160, 000681, 000163, 000284, 000162, 000344, 
> 000663, 000264, 000462, 000425, 000667, 000225, 000302, 000467, 000588, 
> 000301, 000506, 000307, 000504, 000668, 000628, 000193, 000391, 000197] as 
> part of SST file pruning.
> 2023-08-06 06:35:37,937 INFO 
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache: 
> Loading snapshot. Table key: /vol-w19gk/buck-f9sqw/snap-ae5or
> 2023-08-06 06:35:37,937 ERROR 
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: Error 
> during Snapshot sst filtering
> FILE_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Unable to 
> load snapshot. Snapshot with table key '/vol-w19gk/buck-f9sqw/snap-ae5or' is 
> no longer active 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to