Hi, On 6/14/07, Thomas Mueller <[EMAIL PROTECTED]> wrote:
I think it is. But it is not very important to decide what garbage collection algorithm to use at this stage. It is still possible to switch the algorithm later on. OK it is a bit more work.
ACK, we can get back to that when we have some concrete code to work with.
> The garbage collection process can be run in the background > (it doesn't block normal access) so performance isn't essential It can't run while others are writing to the repository, and I think that's a problem. Example: Lets say the garbage collection algorithm scans the repository from left to right, and 'A' is a link to 'File A'. Now at the beginning of the scan, the repository looks like this: S----------------------------------------------------------A--- after some time: ----------S------------------------------------------------A--- now somebody moves the node that contains A: ---A------------------S----------------------------------------- the scan finishes and didn't find a reference to A: ---A-----------------------------------------------------------S
I think we can cover that fairly easily by adding a hook in PersistenceManager.store() or the code that calls it that "marks" all binary identifiers within the changeset.
> given the amount of space that the approach saves in typical setups > I'm not too worried about reclaiming unused space later than > necessary. That depends on the setup. If you use the repository to manage movies (no versioning), then I would be worried.
I guess if you are working with movies then you'd typically have enough disk space to for example keep a days worth of draft copies (assuming the garbage collector would run daily). And if you are doing a massive (>> 10GB) cleanup to release disk space, then I guess you could also explicitly invoke the garbage collector (or the repository might even be intelligent enough to automatically start the process if it sees a large drop in node count). I think PostgreSQL is using a similar late-release vacuum mechanism with good results. It can be invoked explicitly with the VACUUM statement and there is also an auto-vacuum background thread.
> The main problem I have with reference counting in this case is that > it would bind the data store into transaction handling Yes, a little bit. > and all related issues. Could you clarify? I would increment the counts early (before committing) and decrement the counts late (after the commit), then the worst case is, after a crash, to have a counter that is too high (seldom).
Which means that we would still need to have a garbage collection process. But yes, a solution that doesn't bind the data store to the actual JCR transactions would be much preferred.
Actually, what about back references: each large object knows who (it thinks) is pointing to it. Mark and sweep would then be trivial. The additional space used would be minimal (compared to a large object).
That might work also. I'm open to trying it out but I think we need to make a judgement call at some point on whether the benefits of early release of unused space outweights the added complexity of reference tracking.
> It would also introduce locking inside the data store to avoid > problems with concurrent reference changes. Manipulating references to large objects is not that common I think: moving nodes (maybe) and versioning. I would use simple 'synchronized' blocks.
Fair enough. My main concern here is the qualitative jump from "no synchronization" to "some synchronization" and all the added complexity it brings. But I guess it's a decision that is better made when we have some concrete code and test results that show the benefits of the various solutions.
> > why not store large Strings in the global data store > I was thinking of perhaps adding isString() and getString() methods > DataRecord for checking whether a given binary stream is valid UTF-8 > and for retrieving the encoded string value in case it is. I probably lost you here. The application decides if it wants to use PropertyType.STRING or PropertyType.BINARY. No need to guess the type from the byte array. I was thinking about storing large instances of PropertyType.STRING (java.lang.String) as a file.
Again my point is simplicity. Instead of adding separate record types (DataRecord vs. StringRecord) we could use the single binary record type for both binaries and strings. Of course we could leave the getString() mechanism up to the data store client (it would consume the InputStream to construct the String), but having those methods in the DataRecord interface allows nice optimizations of certain corner cases, like not having to read a large string into memory when doing getProperty(...).getStream().
> Together with the above inline mechanism we should in fact be able to > make no distinction between binary and string values in the > persistence layer. Yes. You could add a property 'isLarge' to InternalValue, or you could extend InternalValue. Actually I think InternalValue is quite memory intensive, it uses two objects for each INTEGER. I suggest to use an interface, and InternalValueInt, InternalValueString, InternalValueLong and so on. And/or use a cache for the most commonly used objects (integer 0-1000, empty String, boolean true/false).
+1
But that's another discussion.
Let's follow up on taht. :-) BR, Jukka Zitting
