[ https://issues.apache.org/jira/browse/OAK-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Csaba Varga updated OAK-7209: ----------------------------- Affects Version/s: 1.10 1.8.2 > Race condition can resurrect blobs during blob GC > ------------------------------------------------- > > Key: OAK-7209 > URL: https://issues.apache.org/jira/browse/OAK-7209 > Project: Jackrabbit Oak > Issue Type: Bug > Components: blob-plugins > Affects Versions: 1.6.5, 1.10, 1.8.2 > Reporter: Csaba Varga > Assignee: Amit Jain > Priority: Minor > > A race condition exists between the scheduled blob ID publishing process and > the GC process that can resurrect the blobs being deleted by the GC. This is > how it can happen: > # MarkSweepGarbageCollector.collectGarbage() starts running. > # As part of the preparation for sweeping, BlobIdTracker.globalMerge() is > called, which merges all blob ID records from the blob store into the local > tracker. > # Sweeping begins deleting files. > # BlobIdTracker.snapshot() gets called by the scheduler. It pushes all blob > ID records that were collected and merged in step 2 back into the blob store, > then deletes the local copies. > # Sweeping completes and tries to remove the successfully deleted blobs from > the tracker. Step 4 already deleted those records from the local files, so > nothing gets removed. > The end result is that all blobs removed during the GC run will be considered > still alive and causes warnings when later GC runs try to remove them again. > The risk is higher the longer the sweep runs, but it can happen during a > short but badly timed GC run as well. (We've found it during a GC run that > took more than 11 hours to complete.) > I can see two ways to approach this: > # Suspend the execution of BlobIdTracker.snapshot() while Blob GC is in > progress. This requires adding new methods to the BlobTracker interface to > allow suspending and resuming snapshotting of the tracker. > # Have the two overloads of BlobIdTracker.remove() do a globalMerge() before > trying to remove anything. This ensures that even if a snapshot() call > happened during the GC run, all IDs are "pulled back" into the local tracker > and can be removed successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)