On Wed, Nov 7, 2012 at 10:52 AM, Michael Dürig <[email protected]> wrote: > > > On 7.11.12 9:48, Thomas Mueller wrote: >> >> Hi, >> >> Didn't we talk once about defining a format for blob id references, so >> that a value of the format "bin:{blobId}" (or similar) is reference? > > > This is exactly the problem I wanted to pinpoint. There is a conceptual leak > here: in order for the Microkernel implementation to know that something is > a reference to a binary, it has to know about the interpretation of the > items in the repository by the upper layers.
the format of references to binaries is documented in the MicroKernel java doc, see "Retention Policy for Binaries" [0]. cheers stefan [0] http://svn.apache.org/repos/asf/jackrabbit/oak/trunk/oak-mk-api/src/main/java/org/apache/jackrabbit/mk/api/MicroKernel.java > > Michael > > >> >> Regards, >> Thomas >> >> >> >> On 11/7/12 10:17 AM, "Michael Dürig" <[email protected]> wrote: >> >>> >>> On a related note: how does the garbage collector even find out whether >>> a binary is "referenced"? That is, on the Microkernel level, what does >>> it actually mean for a binary to be referenced? >>> >>> Michael >>> >>> On 6.11.12 18:45, Michael Marth wrote: >>>> >>>> this might be a weird question from the leftfield, but are we actually >>>> sure that the existing data store concept is worth the trouble? afaiu it >>>> saves us from storing the same binary twice, but leads into the DSGC >>>> topic. would it be possible to make it optional to store/address >>>> binaries by hash (and thus not need DSGC for these configurations)? >>>> >>>> In any case we should definitely avoid to require repo traversal for >>>> DSGC. This would operationally limit the repo sizes Oak can support. >>>> >>>> >>>> -- >>>> Michael Marth | Engineering Manager >>>> +41 61 226 55 22 | [email protected]<mailto:[email protected]> >>>> Barfüsserplatz 6, CH-4001 Basel, Switzerland >>>> >>>> On Nov 6, 2012, at 9:24 AM, Thomas Mueller wrote: >>>> >>>> Hi, >>>> >>>> 1- What's considered an "old" node or commit? Technically, anything >>>> other >>>> than the head revision is old but can we remove them right away or do we >>>> need to retain a number of revisions? If the latter, then how far back >>>> do >>>> we need to retain? >>>> >>>> we discussed this a while back, no good solution back then[1] >>>> >>>> Yes. Somebody has to decide which revisions are no longer needed. >>>> Luckily >>>> it doesn't need to be us :-) We might set a default value (10 minutes or >>>> so), and then give the user the ability to change that, depending on >>>> whether he cares more about disk space or the ability to read old data / >>>> roll back to an old state. >>>> >>>> To free up disk space, BlobStore garbage collection is actually more >>>> important, because usually 90% of the disk space is used by the >>>> BlobStore. >>>> So it would be nice if items (files) in the BlobStore are deleted as >>>> soon >>>> as possible after deleting old revisions. In Jackrabbit 2.x we have seen >>>> that node and data store garbage collection that has to traverse the >>>> whole >>>> repository is problematic if the repository is large. So garbage >>>> collection can be a scalability issue: if we have to traverse all >>>> revisions of all nodes in order to delete unused data, we basically tie >>>> garbage collection speed with repository size, unless if we find a way >>>> to >>>> run it in parallel. But running mark & sweep garbage collection >>>> completely >>>> in parallel is not easy (is it even possible? if yes I would have >>>> guessed >>>> modern JVMs should have it since a long time). So I think if we don't >>>> need >>>> to traverse the repository to delete old nodes, but just traverse the >>>> journal, this would be much less of a problem. >>>> >>>> Regards, >>>> Thomas >>>> >>>> >>>> >> >
