[
https://issues.apache.org/jira/browse/HDDS-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uma Maheswara Rao G resolved HDDS-9126.
---------------------------------------
Resolution: Fixed
Resolving this as corresponding PR is merged.
> [ozone-snapshot] Unordered deletion of snapshots corrupting OM
> --------------------------------------------------------------
>
> Key: HDDS-9126
> URL: https://issues.apache.org/jira/browse/HDDS-9126
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Manager
> Reporter: Soumitra Sulav
> Assignee: Sadanand Shenoy
> Priority: Critical
> Labels: ozone-snapshot, pull-request-available
> Fix For: 1.4.0
>
> Attachments: console.log, ozone-om-quasar-csvjze-1.log,
> ozone-om-quasar-csvjze-2.log, ozone-om-quasar-csvjze-3.log,
> ozone-scm-quasar-csvjze-1.log, ozone-scm-quasar-csvjze-2.log,
> ozone-scm-quasar-csvjze-3.log
>
>
> Test scenario :
> The test test_unordered_deletion is trying to delete snapshots in random
> order. And while doing so, we are hitting below exception with OM more often
> than not.
> Once the error is seen, the OM goes into an unhealthy state, and all the
> tests after this couldn't run.
> Snapshot is deleted :
> {code:java}
> 2023-08-06 06:33:27,113 INFO [OM StateMachine ApplyTransaction Thread -
> 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotDeleteRequest:
> Deleted snapshot 'snap-ae5or' under path 'vol-w19gk/buck-f9sqw'
> {code}
> And soon after during copy
> {code:java}
> 2023-08-06 06:39:06,314|INFO|MainThread|machine.py:188 -
> run()||GUID=5210f279-e5c7-4ee9-b652-b49a6b0eb07a|RUNNING:
> /opt/cloudera/parcels/CDH/bin/ozone fs -cp
> ofs://ozone1/vol-w19gk/buck-f9sqw/.snapshot/snap-5qmtv/key_1691303390
> ofs://ozone1/vol-w19gk/buck-f9sqw/
> {code}
> OM log stacktrace:
> {code:java}
> 2023-08-06 06:33:38,126 INFO
> [SstFilteringService#0]-org.apache.hadoop.hdds.utils.db.RocksDatabase:
> Deleting sst file /000396.sst corresponding to column family keyTable from
> db:
> /var/lib/hadoop-ozone/om/data293349/db.snapshots/checkpointState/om.db-0ccb08e9-c5ab-45bb-a71e-8444a2142511
> 2023-08-06 06:33:38,127 INFO
> [SstFilteringService#0]-org.apache.hadoop.hdds.utils.db.managed.ManagedRocksObjectUtils:
> Waited for 1 milliseconds for file
> /var/lib/hadoop-ozone/om/data293349/db.snapshots/checkpointState/om.db-0ccb08e9-c5ab-45bb-a71e-8444a2142511/000396.sst
> deletion.
> 2023-08-06 06:34:37,938 INFO
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache:
> Loading snapshot. Table key: /vol-w19gk/buck-f9sqw/snap-ae5or
> 2023-08-06 06:34:37,938 INFO
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.helpers.OmKeyInfo:
> OmKeyInfo.getCodec ignorePipeline = true
> 2023-08-06 06:34:37,989 ERROR
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: Error
> during Snapshot sst filtering
> FILE_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Unable to
> load snapshot. Snapshot with table key '/vol-w19gk/buck-f9sqw/snap-ae5or' is
> no longer active
> at
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:205)
> at
> org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:151)
> at
> org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:178)
> at
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
> at
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2023-08-06 06:35:30,232 INFO
> [pool-8-thread-1]-org.apache.ozone.rocksdiff.RocksDBCheckpointDiffer:
> Removing SST files: [000410, 000453, 000496, 000253, 000374, 000535, 000611,
> 000456, 000417, 000658, 000338, 000459, 000380, 000185, 000124, 000245,
> 000443, 000200, 000563, 000364, 000562, 000128, 000447, 000248, 000688,
> 000324, 000522, 000367, 000209, 000407, 000129, 000602, 000290, 000296,
> 000692, 000130, 000372, 000690, 000172, 000293, 000157, 000355, 000399,
> 000674, 000233, 000277, 000310, 000398, 000552, 000596, 000474, 000352,
> 000550, 000315, 000359, 000634, 000236, 000599, 000554, 000638, 000637,
> 000559, 000514, 000518, 000160, 000681, 000163, 000284, 000162, 000344,
> 000663, 000264, 000462, 000425, 000667, 000225, 000302, 000467, 000588,
> 000301, 000506, 000307, 000504, 000668, 000628, 000193, 000391, 000197] as
> part of SST file pruning.
> 2023-08-06 06:35:37,937 INFO
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.snapshot.SnapshotCache:
> Loading snapshot. Table key: /vol-w19gk/buck-f9sqw/snap-ae5or
> 2023-08-06 06:35:37,937 ERROR
> [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: Error
> during Snapshot sst filtering
> FILE_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Unable to
> load snapshot. Snapshot with table key '/vol-w19gk/buck-f9sqw/snap-ae5or' is
> no longer active
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]