Amit Jain commented on OAK-7209:
Thanks for your patches.
bq. I've also fixed the unit test that was supposed to catch this issue, but
didn't because it unintentionally caused a merge before removal.
Well the unit test would still be valid for most of the GC cases where there's
no intervening snapshot within the GC sweep phase (i.e. between the globalMerge
and remove). I would add a new method to test this case.
bq. The risk is higher the longer the sweep runs, but it can happen during a
short but badly timed GC run as well. (We've found it during a GC run that took
more than 11 hours to complete.)
In trunk/1.8 the snapshot is allowed only if time the last snapshot happened is
more than the snapshot interval (by default 12 hours). But yes lengthier sweep
cycles are still problematic.
> Race condition can resurrect blobs during blob GC
> Key: OAK-7209
> URL: https://issues.apache.org/jira/browse/OAK-7209
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: blob-plugins
> Affects Versions: 1.6.5, 1.10, 1.8.2
> Reporter: Csaba Varga
> Assignee: Amit Jain
> Priority: Minor
> A race condition exists between the scheduled blob ID publishing process and
> the GC process that can resurrect the blobs being deleted by the GC. This is
> how it can happen:
> # MarkSweepGarbageCollector.collectGarbage() starts running.
> # As part of the preparation for sweeping, BlobIdTracker.globalMerge() is
> called, which merges all blob ID records from the blob store into the local
> # Sweeping begins deleting files.
> # BlobIdTracker.snapshot() gets called by the scheduler. It pushes all blob
> ID records that were collected and merged in step 2 back into the blob store,
> then deletes the local copies.
> # Sweeping completes and tries to remove the successfully deleted blobs from
> the tracker. Step 4 already deleted those records from the local files, so
> nothing gets removed.
> The end result is that all blobs removed during the GC run will be considered
> still alive and causes warnings when later GC runs try to remove them again.
> The risk is higher the longer the sweep runs, but it can happen during a
> short but badly timed GC run as well. (We've found it during a GC run that
> took more than 11 hours to complete.)
> I can see two ways to approach this:
> # Suspend the execution of BlobIdTracker.snapshot() while Blob GC is in
> progress. This requires adding new methods to the BlobTracker interface to
> allow suspending and resuming snapshotting of the tracker.
> # Have the two overloads of BlobIdTracker.remove() do a globalMerge() before
> trying to remove anything. This ensures that even if a snapshot() call
> happened during the GC run, all IDs are "pulled back" into the local tracker
> and can be removed successfully.
This message was sent by Atlassian JIRA