Andres Valloud wrote:
> In general terms, open addressing with linear probing should tie or beat 
> chaining.  For this to happen, the hash function used must be of high 
> quality.  If chaining is beating open addressing with linear probing, 
> I'd strongly suspect that the hash function used is producing 
> significant collisions or unevenly distributed values for the data set 
> in question.  You may want to take a look at the Hash Analysis Tool I 
> wrote for VW (it's in the public Store repository under a MIT license).  
> For a more in-depth treatment of these matters, you may also want to 
> look at the hash book I wrote (www.lulu.com/avSmalltalkBooks), and its 
> matching Smalltalk Solutions 2008 presentation 
> (http://video.google.com/videoplay?docid=-7510297003494499688&hl=en#).

You are, of course, correct. I agree that the place to focus on is the 
hash functions being used. And I believe that there are some simple 
improvements that can be made.

 From a (very!) brief look, this HashTable implementation appears to be 
inspired by difficulties caused by having only 4096 identity hash values 
in Squeak/Pharo. If you have a large collection using identity hash you 
will, by definition, have lots of collisions.

The *best* (and most difficult) solution is to increase the number of 
identity hash values that can be stored in the object header. Until that 
can be done, we adjust as best we can...

--------------

First, the identity-based collections.

The current IdentitySet and IdentityDictionary implementations do adjust 
the hash function for large table sizes, with this code in #scanFor:

   finish > 4096
     ifTrue: [hash := anObject identityHash * (finish // 4096)]
     ifFalse: [hash := anObject identityHash].
   start := (hash \\ finish) + 1.

This makes IdentityDictionary pretty much as fast as it can be at most 
sizes, but there's a problem here. Take a look at the HashTable 
comparison graph:

http://img247.echo.cx/img247/3702/chart8vv.png

IdentityDictionary is fast at most sizes, but has problems right around 
4K of size. I'm going to assert (without testing, always dangerous in 
performance work!) that this is because the hash spreading algorithm 
above but always leaves an area at the end of the table array that never 
gets any hash values mapped to it. This will be worst when the table 
size is between 4K and 8K, when the multiplier is still 1. When the 
number of objects in the dictionary gets high enough that the table 
grows again, the hashes start being spread out and it gets fast again.

I'd suggest trying replacing the above code with something like this:

   hash := anObject identityHash * (finish // 4096 + 1)
   start := (hash \\ finish) + 1.

Not only better performing (I expect) but simpler. Possibly even 
better-performing might be this:

   hash := anObject identityHash * <well-chosen constant>
   start := (hash \\ finish) + 1.

where the constant is large enough to get a spread across most of the 
positive SmallInteger range, but not large enough to exceed it. Andres, 
what's a good choice for the constant? Maybe a prime that is never close 
to a prime used for a table size in these collections?


---------------

As the graph shows, Dictionary has a real problem when grown above 4000 
entries, and I would expect Set to have a similar problem. I assume this 
is with objects that do not define a well-spread hash function, such as 
objects that implement hash as identity hash. And this is quite a few 
objects.

I'd suggest using a hash-spreading function on the equality-based 
collections as well. Something like that used for IdentitySet *could* 
work, but if you give that formula objects whose hashes *are* already 
well-spread across the entire SmallInteger range, on most entries you'd 
be creating short-lived large integers, and if you're looking for 
performance you don't want to do that. Using #hashMultiply would 
probably improve things quite a bit, though it might be possible to do a 
bit better than that.

With changes along these lines, I agree with Andres that linear-probing 
collections should be able to equal or beat the performance of chained 
collections, and are a better choice for general-purpose collections.

Regards,

-Martin

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Reply via email to