> You are, of course, correct. I agree that the place to focus on is the
> hash functions being used. And I believe that there are some simple
> improvements that can be made.
>
> From a (very!) brief look, this HashTable implementation appears to be
> inspired by difficulties caused by having only 4096 identity hash  
> values
> in Squeak/Pharo. If you have a large collection using identity hash  
> you
> will, by definition, have lots of collisions.

yyyyyyyyeeeeeeeeeeesssssssssssssssssssssssss
and get linearrrrrrrrrrrrrrrrrrrrrly boring and slow.

> The *best* (and most difficult) solution is to increase the number of
> identity hash values that can be stored in the object header. Until  
> that
> can be done, we adjust as best we can...
>
> --------------
>
> First, the identity-based collections.
>
> The current IdentitySet and IdentityDictionary implementations do  
> adjust
> the hash function for large table sizes, with this code in #scanFor:
>
>   finish > 4096
>     ifTrue: [hash := anObject identityHash * (finish // 4096)]
>     ifFalse: [hash := anObject identityHash].
>   start := (hash \\ finish) + 1.
>
> This makes IdentityDictionary pretty much as fast as it can be at most
> sizes, but there's a problem here. Take a look at the HashTable
> comparison graph:
>
> http://img247.echo.cx/img247/3702/chart8vv.png
>
> IdentityDictionary is fast at most sizes, but has problems right  
> around
> 4K of size. I'm going to assert (without testing, always dangerous in
> performance work!) that this is because the hash spreading algorithm
> above but always leaves an area at the end of the table array that  
> never
> gets any hash values mapped to it. This will be worst when the table
> size is between 4K and 8K, when the multiplier is still 1. When the
> number of objects in the dictionary gets high enough that the table
> grows again, the hashes start being spread out and it gets fast again.
>
> I'd suggest trying replacing the above code with something like this:
>
>   hash := anObject identityHash * (finish // 4096 + 1)
>   start := (hash \\ finish) + 1.
>
> Not only better performing (I expect) but simpler. Possibly even
> better-performing might be this:
>
>   hash := anObject identityHash * <well-chosen constant>
>   start := (hash \\ finish) + 1.
>
> where the constant is large enough to get a spread across most of the
> positive SmallInteger range, but not large enough to exceed it.  
> Andres,
> what's a good choice for the constant? Maybe a prime that is never  
> close
> to a prime used for a table size in these collections?
>
>
> ---------------
>
> As the graph shows, Dictionary has a real problem when grown above  
> 4000
> entries, and I would expect Set to have a similar problem. I assume  
> this
> is with objects that do not define a well-spread hash function, such  
> as
> objects that implement hash as identity hash. And this is quite a few
> objects.
>
> I'd suggest using a hash-spreading function on the equality-based
> collections as well. Something like that used for IdentitySet *could*
> work, but if you give that formula objects whose hashes *are* already
> well-spread across the entire SmallInteger range, on most entries  
> you'd
> be creating short-lived large integers, and if you're looking for
> performance you don't want to do that. Using #hashMultiply would
> probably improve things quite a bit, though it might be possible to  
> do a
> bit better than that.
>
> With changes along these lines, I agree with Andres that linear- 
> probing
> collections should be able to equal or beat the performance of chained
> collections, and are a better choice for general-purpose collections.
>
> Regards,
>
> -Martin

Martin
don;t you have some changes so that we could try :)

Stef

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Reply via email to