RE: [RT]: Calculating the cache key

Lewis, Andrew J Tue, 21 May 2002 08:00:02 -0700

I like that idea...
> ----------
> From:         Geoff Howard[SMTP:[EMAIL PROTECTED]]
> Reply To:     [EMAIL PROTECTED]
> Sent:         Tuesday, May 21, 2002 11:10 AM
> To:   '[EMAIL PROTECTED]'
> Subject:      RE: [RT]: Calculating the cache key
> 
> The performance of key comparison is especially important for high traffic
> sites (like ours).  An idea I've had to decrease this importance is:
> 
> An *option* to have the pipeline do asynchron cache validation - if a
> cached
> item exists, use it *then* check it's validity (or cause it to be checked)
> and remove it from the store if it was expired.  This is attractive to us
> because we'll have many steps that will need to be checked (many of our
> pages will involve 5 or 6 or more generators aggregated and included) and
> the performance hit of generating and comparing all those keys is scary.
> This integrates well with the external cache validity concept that Marcelo
> Ochoa/DBPrism has been working on.
>  
> Geoff Howard
> 
> -----Original Message-----
> From: Carsten Ziegeler [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, May 21, 2002 9:16 AM
> To: [EMAIL PROTECTED]
> Subject: RE: [RT]: Calculating the cache key
> 
> 
> 
> Berin Loritsch wrote:
> > 
> > <snip/>
> > 
> > I'm wondering about the whole hash algorithm.  Concatenating
> > strings can get expensive, especially if it is done in multiple
> > sections mutliple times.  The fact is that hash values as a long
> > provide a virtually limitless number of possibilities for cache
> > values in a site.  With 2^32 (or over 4 billion unique entries)
> > you are not likely to ever need to repeat.
> > 
> > Here is the algorythm for a string hash:
> > 
> > s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
> > 
> > 
> > Your possible hash values are limited by the values that can
> > exist in a string.  The shorter your string the fewer possible
> > values.  The fact that they use 31 as a multiplier is only so
> > that they can spread the hash values better (multiples of 2
> > are even more limiting).
> > 
> > Using the formula, you can translate this:
> > 
> > "http://localhost:8080/cocoon/hello.html";
> > 
> > into this:
> > 
> > 'h'*31^(39-1) + 't'*31^(39-2) .... + 'l'
> > 
> > I'll let you do the math yourself.
> 
> Yes, but what if my strings are, let's say 1000 characters long?
> And looking at the code for example in the xslt transformer or
> some other places this is a realistic value.
> 
> > 
> > It also means that computing the hash key takes a proportionate
> > amount of time based on the length of the string.  What makes it
> > worse is the fact that the unique part of the string is only after
> > the "http://localhost:8080/cocoon/"; section.  We are doing more
> > work for less gain.
> > 
> > All a hashCode is supposed to buy us is a quick way to get to the
> > information we need.  A way to get us to the "bucket" if you will,
> > which if there are collisions we need to iterate through the entries
> > in the bucket until we find our match.  This takes even longer when
> > we are dealing with strings.
> > 
> > What we need is a way of either generating a match based on a GUID
> > (Globally Unique IDentifier), or create a compound key that has
> > a group of id's generated from each component.  The compound key
> > approach has a drawback in that in order to match a key, you have to
> > iterate through all the parts of the key.  However you can easily
> > generate a hash value from the parts, and there are fewer parts.
> > 
> > Consider a compound key that has an int for each part.  That integer
> > is made up of the top half marking the component ID (there will be less
> > than 65,536 unique types of components), and the unique entry within
> > that component (another 65,536 entries).  We can shift the balance
> > as necessary, but keep up.  A compound key has one part for each
> > component that is part of the pipeline.  Something like this:
> > 
> > p[0]=fe01b342 { FileGenerator, resource="hello.xml" }
> > p[1]=fb021431 { XSLTTransformer, resource="layout.xsl" }
> > p[2]=fb02542e { XSLTTransformer, resource="skin.xsl" }
> > p[3]=fc010000 { HtmlSerializer }
> > 
> > This is a compound key with only four entries (much quicker to iterate
> > through the entries than a really long string).  The Hash algorithm is
> > an exercise I can let people far more intelligent than I to come up
> > with.
> > Basically it should be driven by the four entries.
> > 
> > Perhaps something along these lines:
> > 
> > s[0]%n*2^(32/n) + s[1]%n*s^(32/n) .. s[n-1]%n*s^(32/n)
> > 
> > I'm no genious, but I think it is better than working with strings if
> > we really don't have to.
> > 
> The current caching algorithm uses this compound key with the exception
> that the longs are used to build a string:
> 
> "FileGenerator:fe01b342;XSLTTransformer:fb021431;XSLTTransformer:fb02542e;
> HTMLSerializer:1"
> 
> This string can than simply be used for comparissions etc.
> 
> The string approach was used as the algorithm does not know how many
> components the key will have beforehand. So the array for holding all 
> the components has to be build dynamically, which decreases performance
> a little bit.
> 
> Of course, I see your point, that comparing a (hashed) long value is much
> faster than comparing strings. 
> 
> But to build the hash value a lot of string operations are used anyway
> and the hashing takes time. And all this has to be done to compare two 
> keys.
> So is it really still faster than simply concatenating strings and
> comparing them?
> 
> I really don't know - perhaps we could write a performance test case
> and see what's faster.
> 
> Carsten
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, email: [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, email: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

RE: [RT]: Calculating the cache key

Reply via email to