Re: NGP: Value records

Jukka Zitting Fri, 15 Jun 2007 01:51:07 -0700

Hi,

On 6/14/07, Thomas Mueller <[EMAIL PROTECTED]> wrote:

I think it is. But it is not very important to decide what garbage
collection algorithm to use at this stage. It is still possible to
switch the algorithm later on. OK it is a bit more work.


ACK, we can get back to that when we have some concrete code to work with.

> The garbage collection process can be run in the background
> (it doesn't block normal access) so performance isn't essential

It can't run while others are writing to the repository, and I think
that's a problem. Example: Lets say the garbage collection algorithm
scans the repository from left to right, and 'A' is a link to 'File
A'. Now at the beginning of the scan, the repository looks like this:
S----------------------------------------------------------A---
after some time:
----------S------------------------------------------------A---
now somebody moves the node that contains A:
---A------------------S-----------------------------------------
the scan finishes and didn't find a reference to A:
---A-----------------------------------------------------------S


I think we can cover that fairly easily by adding a hook in
PersistenceManager.store() or the code that calls it that "marks" all
binary identifiers within the changeset.

> given the amount of space that the approach saves in typical setups
> I'm not too worried about reclaiming unused space later than
> necessary.

That depends on the setup. If you use the repository to manage movies
(no versioning), then I would be worried.


I guess if you are working with movies then you'd typically have
enough disk space to for example keep a days worth of draft copies
(assuming the garbage collector would run daily). And if you are doing
a massive (>> 10GB) cleanup to release disk space, then I guess you
could also explicitly invoke the garbage collector (or the repository
might even be intelligent enough to automatically start the process if
it sees a large drop in node count).

I think PostgreSQL is using a similar late-release vacuum mechanism
with good results. It can be invoked explicitly with the VACUUM
statement and there is also an auto-vacuum background thread.

> The main problem I have with reference counting in this case is that
> it would bind the data store into transaction handling

Yes, a little bit.

> and all related issues.

Could you clarify? I would increment the counts early (before
committing) and decrement the counts late (after the commit), then the
worst case is, after a crash, to have a counter that is too high
(seldom).


Which means that we would still need to have a garbage collection
process. But yes, a solution that doesn't bind the data store to the
actual JCR transactions would be much preferred.

Actually, what about back references: each large object knows who (it
thinks) is pointing to it. Mark and sweep would then be trivial. The
additional space used would be minimal (compared to a large object).


That might work also. I'm open to trying it out but I think we need to
make a judgement call at some point on whether the benefits of early
release of unused space outweights the added complexity of reference
tracking.

> It would also introduce locking inside the data store to avoid
> problems with concurrent reference changes.

Manipulating references to large objects is not that common I think:
moving nodes (maybe) and versioning. I would use simple 'synchronized'
blocks.


Fair enough. My main concern here is the qualitative jump from "no
synchronization" to "some synchronization" and all the added
complexity it brings. But I guess it's a decision that is better made
when we have some concrete code and test results that show the
benefits of the various solutions.

> > why not store large Strings in the global data store
> I was thinking of perhaps adding isString() and getString() methods

> DataRecord for checking whether a given binary stream is valid UTF-8
> and for retrieving the encoded string value in case it is.

I probably lost you here. The application decides if it wants to use
PropertyType.STRING or PropertyType.BINARY. No need to guess the type
from the byte array. I was thinking about storing large instances of
PropertyType.STRING (java.lang.String) as a file.


Again my point is simplicity. Instead of adding separate record types
(DataRecord vs. StringRecord) we could use the single binary record
type for both binaries and strings. Of course we could leave the
getString() mechanism up to the data store client (it would consume
the InputStream to construct the String), but having those methods in
the DataRecord interface allows nice optimizations of certain corner
cases, like not having to read a large string into memory when doing
getProperty(...).getStream().

> Together with the above inline mechanism we should in fact be able to
> make no distinction between binary and string values in the
> persistence layer.

Yes. You could add a property 'isLarge' to InternalValue, or you could
extend InternalValue. Actually I think InternalValue is quite memory
intensive, it uses two objects for each INTEGER. I suggest to use an
interface, and InternalValueInt, InternalValueString,
InternalValueLong and so on. And/or use a cache for the most commonly
used objects (integer 0-1000, empty String, boolean true/false).

+1

But that's another discussion.


Let's follow up on taht. :-)

BR,

Jukka Zitting

Re: NGP: Value records

Reply via email to