Csaba Varga commented on OAK-7209:
I've created pull requests for my proposed fix. I've taken the second approach
of doing a globalMerge() before removing files from the tracker; this way, the
behavior isn't specific to garbage collection. I've also fixed the unit test
that was supposed to catch this issue, but didn't because it unintentionally
caused a merge before removal. I've also updated the affected version list
since the 1.8 and 1.10 branches are also affected.
Here are the pull requests (I did backports to 1.6 and 1.8 because I'd love to
have this fixed in 1.6, the version I'm currently using.)
* Trunk: [https://github.com/apache/jackrabbit-oak/pull/81]
* 1.8: [https://github.com/apache/jackrabbit-oak/pull/82]
* 1.6: https://github.com/apache/jackrabbit-oak/pull/83
> Race condition can resurrect blobs during blob GC
> Key: OAK-7209
> URL: https://issues.apache.org/jira/browse/OAK-7209
> Project: Jackrabbit Oak
> Issue Type: Bug
> Components: blob-plugins
> Affects Versions: 1.6.5, 1.10, 1.8.2
> Reporter: Csaba Varga
> Assignee: Amit Jain
> Priority: Minor
> A race condition exists between the scheduled blob ID publishing process and
> the GC process that can resurrect the blobs being deleted by the GC. This is
> how it can happen:
> # MarkSweepGarbageCollector.collectGarbage() starts running.
> # As part of the preparation for sweeping, BlobIdTracker.globalMerge() is
> called, which merges all blob ID records from the blob store into the local
> # Sweeping begins deleting files.
> # BlobIdTracker.snapshot() gets called by the scheduler. It pushes all blob
> ID records that were collected and merged in step 2 back into the blob store,
> then deletes the local copies.
> # Sweeping completes and tries to remove the successfully deleted blobs from
> the tracker. Step 4 already deleted those records from the local files, so
> nothing gets removed.
> The end result is that all blobs removed during the GC run will be considered
> still alive and causes warnings when later GC runs try to remove them again.
> The risk is higher the longer the sweep runs, but it can happen during a
> short but badly timed GC run as well. (We've found it during a GC run that
> took more than 11 hours to complete.)
> I can see two ways to approach this:
> # Suspend the execution of BlobIdTracker.snapshot() while Blob GC is in
> progress. This requires adding new methods to the BlobTracker interface to
> allow suspending and resuming snapshotting of the tracker.
> # Have the two overloads of BlobIdTracker.remove() do a globalMerge() before
> trying to remove anything. This ensures that even if a snapshot() call
> happened during the GC run, all IDs are "pulled back" into the local tracker
> and can be removed successfully.
This message was sent by Atlassian JIRA