I like that idea... > ---------- > From: Geoff Howard[SMTP:[EMAIL PROTECTED]] > Reply To: [EMAIL PROTECTED] > Sent: Tuesday, May 21, 2002 11:10 AM > To: '[EMAIL PROTECTED]' > Subject: RE: [RT]: Calculating the cache key > > The performance of key comparison is especially important for high traffic > sites (like ours). An idea I've had to decrease this importance is: > > An *option* to have the pipeline do asynchron cache validation - if a > cached > item exists, use it *then* check it's validity (or cause it to be checked) > and remove it from the store if it was expired. This is attractive to us > because we'll have many steps that will need to be checked (many of our > pages will involve 5 or 6 or more generators aggregated and included) and > the performance hit of generating and comparing all those keys is scary. > This integrates well with the external cache validity concept that Marcelo > Ochoa/DBPrism has been working on. > > Geoff Howard > > -----Original Message----- > From: Carsten Ziegeler [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, May 21, 2002 9:16 AM > To: [EMAIL PROTECTED] > Subject: RE: [RT]: Calculating the cache key > > > > Berin Loritsch wrote: > > > > <snip/> > > > > I'm wondering about the whole hash algorithm. Concatenating > > strings can get expensive, especially if it is done in multiple > > sections mutliple times. The fact is that hash values as a long > > provide a virtually limitless number of possibilities for cache > > values in a site. With 2^32 (or over 4 billion unique entries) > > you are not likely to ever need to repeat. > > > > Here is the algorythm for a string hash: > > > > s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] > > > > > > Your possible hash values are limited by the values that can > > exist in a string. The shorter your string the fewer possible > > values. The fact that they use 31 as a multiplier is only so > > that they can spread the hash values better (multiples of 2 > > are even more limiting). > > > > Using the formula, you can translate this: > > > > "http://localhost:8080/cocoon/hello.html" > > > > into this: > > > > 'h'*31^(39-1) + 't'*31^(39-2) .... + 'l' > > > > I'll let you do the math yourself. > > Yes, but what if my strings are, let's say 1000 characters long? > And looking at the code for example in the xslt transformer or > some other places this is a realistic value. > > > > > It also means that computing the hash key takes a proportionate > > amount of time based on the length of the string. What makes it > > worse is the fact that the unique part of the string is only after > > the "http://localhost:8080/cocoon/" section. We are doing more > > work for less gain. > > > > All a hashCode is supposed to buy us is a quick way to get to the > > information we need. A way to get us to the "bucket" if you will, > > which if there are collisions we need to iterate through the entries > > in the bucket until we find our match. This takes even longer when > > we are dealing with strings. > > > > What we need is a way of either generating a match based on a GUID > > (Globally Unique IDentifier), or create a compound key that has > > a group of id's generated from each component. The compound key > > approach has a drawback in that in order to match a key, you have to > > iterate through all the parts of the key. However you can easily > > generate a hash value from the parts, and there are fewer parts. > > > > Consider a compound key that has an int for each part. That integer > > is made up of the top half marking the component ID (there will be less > > than 65,536 unique types of components), and the unique entry within > > that component (another 65,536 entries). We can shift the balance > > as necessary, but keep up. A compound key has one part for each > > component that is part of the pipeline. Something like this: > > > > p[0]=fe01b342 { FileGenerator, resource="hello.xml" } > > p[1]=fb021431 { XSLTTransformer, resource="layout.xsl" } > > p[2]=fb02542e { XSLTTransformer, resource="skin.xsl" } > > p[3]=fc010000 { HtmlSerializer } > > > > This is a compound key with only four entries (much quicker to iterate > > through the entries than a really long string). The Hash algorithm is > > an exercise I can let people far more intelligent than I to come up > > with. > > Basically it should be driven by the four entries. > > > > Perhaps something along these lines: > > > > s[0]%n*2^(32/n) + s[1]%n*s^(32/n) .. s[n-1]%n*s^(32/n) > > > > I'm no genious, but I think it is better than working with strings if > > we really don't have to. > > > The current caching algorithm uses this compound key with the exception > that the longs are used to build a string: > > "FileGenerator:fe01b342;XSLTTransformer:fb021431;XSLTTransformer:fb02542e; > HTMLSerializer:1" > > This string can than simply be used for comparissions etc. > > The string approach was used as the algorithm does not know how many > components the key will have beforehand. So the array for holding all > the components has to be build dynamically, which decreases performance > a little bit. > > Of course, I see your point, that comparing a (hashed) long value is much > faster than comparing strings. > > But to build the hash value a lot of string operations are used anyway > and the hashing takes time. And all this has to be done to compare two > keys. > So is it really still faster than simply concatenating strings and > comparing them? > > I really don't know - perhaps we could write a performance test case > and see what's faster. > > Carsten > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, email: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, email: [EMAIL PROTECTED] >
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]