RE: [RT]: Calculating the cache key

Carsten Ziegeler Tue, 21 May 2002 05:55:28 -0700


Berin Loritsch wrote:
> 
> <snip/>
> 
> I'm wondering about the whole hash algorithm.  Concatenating
> strings can get expensive, especially if it is done in multiple
> sections mutliple times.  The fact is that hash values as a long
> provide a virtually limitless number of possibilities for cache
> values in a site.  With 2^32 (or over 4 billion unique entries)
> you are not likely to ever need to repeat.
> 
> Here is the algorythm for a string hash:
> 
> s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
> 
> 
> Your possible hash values are limited by the values that can
> exist in a string.  The shorter your string the fewer possible
> values.  The fact that they use 31 as a multiplier is only so
> that they can spread the hash values better (multiples of 2
> are even more limiting).
> 
> Using the formula, you can translate this:
> 
> "http://localhost:8080/cocoon/hello.html";
> 
> into this:
> 
> 'h'*31^(39-1) + 't'*31^(39-2) .... + 'l'
> 
> I'll let you do the math yourself.


Yes, but what if my strings are, let's say 1000 characters long?
And looking at the code for example in the xslt transformer or
some other places this is a realistic value.

> 
> It also means that computing the hash key takes a proportionate
> amount of time based on the length of the string.  What makes it
> worse is the fact that the unique part of the string is only after
> the "http://localhost:8080/cocoon/"; section.  We are doing more
> work for less gain.
> 
> All a hashCode is supposed to buy us is a quick way to get to the
> information we need.  A way to get us to the "bucket" if you will,
> which if there are collisions we need to iterate through the entries
> in the bucket until we find our match.  This takes even longer when
> we are dealing with strings.
> 
> What we need is a way of either generating a match based on a GUID
> (Globally Unique IDentifier), or create a compound key that has
> a group of id's generated from each component.  The compound key
> approach has a drawback in that in order to match a key, you have to
> iterate through all the parts of the key.  However you can easily
> generate a hash value from the parts, and there are fewer parts.
> 
> Consider a compound key that has an int for each part.  That integer
> is made up of the top half marking the component ID (there will be less
> than 65,536 unique types of components), and the unique entry within
> that component (another 65,536 entries).  We can shift the balance
> as necessary, but keep up.  A compound key has one part for each
> component that is part of the pipeline.  Something like this:
> 
> p[0]=fe01b342 { FileGenerator, resource="hello.xml" }
> p[1]=fb021431 { XSLTTransformer, resource="layout.xsl" }
> p[2]=fb02542e { XSLTTransformer, resource="skin.xsl" }
> p[3]=fc010000 { HtmlSerializer }
> 
> This is a compound key with only four entries (much quicker to iterate
> through the entries than a really long string).  The Hash algorithm is
> an exercise I can let people far more intelligent than I to come up
> with.
> Basically it should be driven by the four entries.
> 
> Perhaps something along these lines:
> 
> s[0]%n*2^(32/n) + s[1]%n*s^(32/n) .. s[n-1]%n*s^(32/n)
> 
> I'm no genious, but I think it is better than working with strings if
> we really don't have to.
> 
The current caching algorithm uses this compound key with the exception
that the longs are used to build a string:

"FileGenerator:fe01b342;XSLTTransformer:fb021431;XSLTTransformer:fb02542e;
HTMLSerializer:1"

This string can than simply be used for comparissions etc.

The string approach was used as the algorithm does not know how many
components the key will have beforehand. So the array for holding all 
the components has to be build dynamically, which decreases performance
a little bit.

Of course, I see your point, that comparing a (hashed) long value is much
faster than comparing strings. 

But to build the hash value a lot of string operations are used anyway
and the hashing takes time. And all this has to be done to compare two 
keys.
So is it really still faster than simply concatenating strings and
comparing them?

I really don't know - perhaps we could write a performance test case
and see what's faster.

Carsten

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

RE: [RT]: Calculating the cache key

Reply via email to