> You are, of course, correct. I agree that the place to focus on is the > hash functions being used. And I believe that there are some simple > improvements that can be made. > > From a (very!) brief look, this HashTable implementation appears to be > inspired by difficulties caused by having only 4096 identity hash > values > in Squeak/Pharo. If you have a large collection using identity hash > you > will, by definition, have lots of collisions.
yyyyyyyyeeeeeeeeeeesssssssssssssssssssssssss and get linearrrrrrrrrrrrrrrrrrrrrly boring and slow. > The *best* (and most difficult) solution is to increase the number of > identity hash values that can be stored in the object header. Until > that > can be done, we adjust as best we can... > > -------------- > > First, the identity-based collections. > > The current IdentitySet and IdentityDictionary implementations do > adjust > the hash function for large table sizes, with this code in #scanFor: > > finish > 4096 > ifTrue: [hash := anObject identityHash * (finish // 4096)] > ifFalse: [hash := anObject identityHash]. > start := (hash \\ finish) + 1. > > This makes IdentityDictionary pretty much as fast as it can be at most > sizes, but there's a problem here. Take a look at the HashTable > comparison graph: > > http://img247.echo.cx/img247/3702/chart8vv.png > > IdentityDictionary is fast at most sizes, but has problems right > around > 4K of size. I'm going to assert (without testing, always dangerous in > performance work!) that this is because the hash spreading algorithm > above but always leaves an area at the end of the table array that > never > gets any hash values mapped to it. This will be worst when the table > size is between 4K and 8K, when the multiplier is still 1. When the > number of objects in the dictionary gets high enough that the table > grows again, the hashes start being spread out and it gets fast again. > > I'd suggest trying replacing the above code with something like this: > > hash := anObject identityHash * (finish // 4096 + 1) > start := (hash \\ finish) + 1. > > Not only better performing (I expect) but simpler. Possibly even > better-performing might be this: > > hash := anObject identityHash * <well-chosen constant> > start := (hash \\ finish) + 1. > > where the constant is large enough to get a spread across most of the > positive SmallInteger range, but not large enough to exceed it. > Andres, > what's a good choice for the constant? Maybe a prime that is never > close > to a prime used for a table size in these collections? > > > --------------- > > As the graph shows, Dictionary has a real problem when grown above > 4000 > entries, and I would expect Set to have a similar problem. I assume > this > is with objects that do not define a well-spread hash function, such > as > objects that implement hash as identity hash. And this is quite a few > objects. > > I'd suggest using a hash-spreading function on the equality-based > collections as well. Something like that used for IdentitySet *could* > work, but if you give that formula objects whose hashes *are* already > well-spread across the entire SmallInteger range, on most entries > you'd > be creating short-lived large integers, and if you're looking for > performance you don't want to do that. Using #hashMultiply would > probably improve things quite a bit, though it might be possible to > do a > bit better than that. > > With changes along these lines, I agree with Andres that linear- > probing > collections should be able to equal or beat the performance of chained > collections, and are a better choice for general-purpose collections. > > Regards, > > -Martin Martin don;t you have some changes so that we could try :) Stef _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
