Hi oak-dev,

I’ve created a new pull request [0] to review changes I made to get garbage
collection to work for the composite data store.  I’d love some feedback.

Since this entails a change to the MarkSweepGarbageCollector, we should
discuss the change here to see if there are concerns with it or if there is
a better approach.

Let me try to briefly explain the change.

In the use case I tested there are two Oak repositories, one which we will
call primary and one which we will call secondary.  Primary gets created
first; secondary is created by cloning the node store of primary, then
using a CompositeDataStore to have two delegate data stores.  The first
delegate is the same as the data store for primary, in read-only mode.  The
second delegate is only accessible by the secondary repo.

Let the data store shared by primary and secondary be called DS_P and the
data store being used only by the secondary be called DS_S.  DS_P can be
read by secondary but not modified, so all changes on secondary are saved
in DS_S.  Primary can still make changes to DS_P.

Suppose after creating both repositories, records A and B are deleted from
the primary repo, and records B and C are deleted from the secondary repo.
Since DS_P is shared, only blob B should actually be deleted from DS_P via
GC.  After both repositories run their “mark” phase, the primary repo
created a “references” file in DS_P excluding A and B, meaning primary
thinks A and B can both be deleted.  And the secondary repo created a
“references” file in DS_P excluding B and C, meaning secondary thinks B and
C can both be deleted.

Suppose then primary runs the sweep phase first.  It will first verify that
it has a references file for each repository file in DS_P.  Since both
primary and secondary put one there this test passes.  It will then merge
all the data in all the references files in DS_P with its own local view of
the existing blobs, and come up with a set of blobs to delete.  Primary
will conclude that blobs B and C should be deleted - B because both primary
and secondary said it is deleted, and C because secondary said it should be
deleted and primary has no knowledge of C so it will assume it is okay to
delete.  At this point primary will delete B and try to delete C and fail
(which is ok).  Then primary will delete its “references” file from DS_P
and call the sweep phase complete.

Now the problem comes when secondary tries to run the sweep phase.  It will
first try to verify that a references file exists for each repository file
in DS_P - and fail.  This fails because primary deleted its references file
already.  Thus secondary will cancel GC and thus blob C never ends up
getting deleted.  Note that secondary must delete C because it is the only
repository that knows about C.

This same situation exists also if secondary sweeps first.  If record D was
created by primary after secondary was cloned, then D is deleted by
primary, secondary never knows about blob D so it cannot delete it during
the sweep phase - it can only be deleted by primary.

The change I made to the garbage collector is that when a repository
finishes the sweep phase, it doesn’t necessarily delete the references
file.  Instead it marks the data store with a “sweepComplete” file
indicating that this repository finished the sweep phase.  When there is a
“sweepComplete” file for every repository (in other words, the last
repository to sweep), then all the references files are deleted.

I wrote an integration test to test DSGC for this specific composite data
store use case at [1].

All the oak unit tests pass with this change.  I am concerned about any
unforeseen consequences that others on-list may have about this change.
Also there’s the issue that sweeping must now be done by every repository
sharing the data store, which will have some inefficiencies.  I’m open to
changes or to a different approach if we can solve the problem described
above still.

0 - https://github.com/apache/jackrabbit-oak/pull/80
1 -



Reply via email to