RE: [RT]: Calculating the cache key

Berin Loritsch Tue, 21 May 2002 05:38:35 -0700

> From: Carsten Ziegeler [mailto:[EMAIL PROTECTED]] 
> 
> Hi,
> 
> with the new source resolving (from excalibur) we can 
> implement a better caching mechanism (I mentioned this 
> several times in the last weeks).
> 
> I'm just wondering, how the cache key should be generated.
> 
> The current solution (Interface Cacheable) uses a long which 
> is in 99,9% a hash value of the URI (with some additional 
> information sometimes). This key is unique within the space 
> of the component 
> (e.g. file generator, xslt transformer etc).
> So the current implementation is nearly always the same:
> - Creating a string buffer
> - Appending all information
> - Hashing the string buffer


<snip/>

I'm wondering about the whole hash algorithm.  Concatenating
strings can get expensive, especially if it is done in multiple
sections mutliple times.  The fact is that hash values as a long
provide a virtually limitless number of possibilities for cache
values in a site.  With 2^32 (or over 4 billion unique entries)
you are not likely to ever need to repeat.

Here is the algorythm for a string hash:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]


Your possible hash values are limited by the values that can
exist in a string.  The shorter your string the fewer possible
values.  The fact that they use 31 as a multiplier is only so
that they can spread the hash values better (multiples of 2
are even more limiting).

Using the formula, you can translate this:

"http://localhost:8080/cocoon/hello.html";

into this:

'h'*31^(39-1) + 't'*31^(39-2) .... + 'l'

I'll let you do the math yourself.

It also means that computing the hash key takes a proportionate
amount of time based on the length of the string.  What makes it
worse is the fact that the unique part of the string is only after
the "http://localhost:8080/cocoon/"; section.  We are doing more
work for less gain.

All a hashCode is supposed to buy us is a quick way to get to the
information we need.  A way to get us to the "bucket" if you will,
which if there are collisions we need to iterate through the entries
in the bucket until we find our match.  This takes even longer when
we are dealing with strings.

What we need is a way of either generating a match based on a GUID
(Globally Unique IDentifier), or create a compound key that has
a group of id's generated from each component.  The compound key
approach has a drawback in that in order to match a key, you have to
iterate through all the parts of the key.  However you can easily
generate a hash value from the parts, and there are fewer parts.

Consider a compound key that has an int for each part.  That integer
is made up of the top half marking the component ID (there will be less
than 65,536 unique types of components), and the unique entry within
that component (another 65,536 entries).  We can shift the balance
as necessary, but keep up.  A compound key has one part for each
component that is part of the pipeline.  Something like this:

p[0]=fe01b342 { FileGenerator, resource="hello.xml" }
p[1]=fb021431 { XSLTTransformer, resource="layout.xsl" }
p[2]=fb02542e { XSLTTransformer, resource="skin.xsl" }
p[3]=fc010000 { HtmlSerializer }

This is a compound key with only four entries (much quicker to iterate
through the entries than a really long string).  The Hash algorithm is
an exercise I can let people far more intelligent than I to come up
with.
Basically it should be driven by the four entries.

Perhaps something along these lines:

s[0]%n*2^(32/n) + s[1]%n*s^(32/n) .. s[n-1]%n*s^(32/n)

I'm no genious, but I think it is better than working with strings if
we really don't have to.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

RE: [RT]: Calculating the cache key

Reply via email to