Hi,

On 4/24/07, David Nuescheler <[EMAIL PROTECTED]> wrote:
> A value record would essentially be an array of bytes as defined in
> Value.getStream(). In other words the integer value 123 and the string
> value "123" would both be stored in the same value record. More
> specific typing information would be indicated in the property record
> that refers to that value. For example an integer property and a
> string property could both point to the same value record, but have
> different property types that indicate the default interpretation of
> the value.
i think that with small values we have to keep in mind that the
"key" (value identifier) may be bigger than the actual value and of
course the additional indirection also has a performance impact.
do you think that we should consider a minimum size for value's to
key stored in this manner? personally, i think that this might make
sense.

For consistency I would use such value records for all values,
regardless of the value size. I'd like to keep the value identifiers
as short as possible, optimally just 64 bits, to avoid too much
storage and bandwidth overhead. The indirection costs could probably
best be avoided by storing copies of short value contents along with
the value identifiers where the values are referenced.

anyway, what key did you have in mind?
i would assume some sort of a hash (md5) could be great or is this
still more abstract?

I was thinking about something more concrete, like a direct disk
offset. The value identifier could for example be a 64 bit integer
with the first 32 bits identifying the revision that contains the
value and the last 32 bits being the offset of the value record within
a "value file". I haven't yet calculated whether such a scheme gives
us a large enough identifier space.

> Name and path values are stored as strings using namespace prefixes
> from an internal namespace registry. Stability of such values is
> enforced by restricting this internal namespace registry to never
> remove or modify existing prefix mappings, only new namespace mappings
> can be added.
sounds good, i assume that the "internal" namespace registry gets
its initial prefix mappings from the "public" namespace registry?
i think having the same prefixes could be beneficial since remappings
and removals are very rare even in the public registry and this would
allow us to optimize the more typical case even better.

Exactly. In most cases, like when using the standard JCR prefix
mappings, the stored name and path values can be passed as-is through
the JCR API.

> Achieving uniqueness of the value records requires a way to determine
> whether an instance of a given value already exists. Some indexing is
> needed to avoid having to traverse the entire set of existing value
> records for each new value being created.
i agree and i think we have to make sure that the overhead
of calculating the key (value identifier) is reasonable, so
"insert performance" doesn't suffer too much.

Note that the "value key" can well be different from the value
identifier. I was thinking of using something like a standard (and
fast) CRC code as the hash key for looking up potential matches. For
large binaries we could also calculate a SHA checksum to avoid having
to read through the entire byte stream when checking for equality. For
short values the CRC coupled with an exact byte comparison should be
good enough.

i could even see an asynchronous model that "inlines" values
of all sizes initially and then leaves it up to some sort of garbage
collection job to "extract" the large values and stores them as
immutable value records...
this could preserve "insert performance" and allows to benefit from
efficient operations for things like copy, clone, etc and of course the
space consumption benefits.

I would be ready to trade some insert performance for more
consistency, but let's see how much the cost would be in practice.

BR,

Jukka Zitting

Reply via email to