Martin,
> You are, of course, correct.

Well yes, thank you :), but it also helps when good, qualified people go 
over one's assertions to make sure they're not wrong.

> --------------
>
> First, the identity-based collections.
>
> The current IdentitySet and IdentityDictionary implementations do adjust 
> the hash function for large table sizes, with this code in #scanFor:
>
>    finish > 4096
>      ifTrue: [hash := anObject identityHash * (finish // 4096)]
>      ifFalse: [hash := anObject identityHash].
>    start := (hash \\ finish) + 1.
>
> This makes IdentityDictionary pretty much as fast as it can be at most 
> sizes, but there's a problem here. Take a look at the HashTable 
> comparison graph:
>
> http://img247.echo.cx/img247/3702/chart8vv.png
>
> IdentityDictionary is fast at most sizes, but has problems right around 
> 4K of size. I'm going to assert (without testing, always dangerous in 
> performance work!) that this is because the hash spreading algorithm 
> above but always leaves an area at the end of the table array that never 
> gets any hash values mapped to it. This will be worst when the table 
> size is between 4K and 8K, when the multiplier is still 1. When the 
> number of objects in the dictionary gets high enough that the table 
> grows again, the hashes start being spread out and it gets fast again.
>   

At first sight, it seems to me that this should be fixed along the lines 
of what you propose.

> I'd suggest trying replacing the above code with something like this:
>
>    hash := anObject identityHash * (finish // 4096 + 1)
>    start := (hash \\ finish) + 1.
>
> Not only better performing (I expect) but simpler. Possibly even 
> better-performing might be this:
>
>    hash := anObject identityHash * <well-chosen constant>
>    start := (hash \\ finish) + 1.
>
> where the constant is large enough to get a spread across most of the 
> positive SmallInteger range, but not large enough to exceed it. Andres, 
> what's a good choice for the constant? Maybe a prime that is never close 
> to a prime used for a table size in these collections?
>   

Actually, since Squeak/Pharo have hashMultiply, I'd guess that constant 
(1664525, IIRC) should suffice for the vast majority of cases (1664525 * 
4095 > 1 billion, that should do).  Hashed collection sizes (past the 
first trivial tiny sizes) should be chosen carefully.  Knuth has some 
suggestions for how to select good prime sizes.  In fact, 1664525 should 
be seen as one potential hashed collection size too (but never used as 
such).  The hash book has all the details on this.

> I'd suggest using a hash-spreading function on the equality-based 
> collections as well.

Ideally, I'd do something along the lines of

Object>>scaledIdentityHash

  ^self identityHash bitShift: 18  "I think, basically as much as 
possible without creating a large integer --- note that multiplying by a 
power of two has the effect of a permutation modulo a well chosen prime 
because gcd(2^k, wellChosenPrime) = 1 and so 2^k is inversible mod 
wellChosenPrime"


implement

Object>>hash

  ^self scaledIdentityHash  "as opposed to identityHash"


and then substitute sending scaledIdentityHash for identityHash in the 
identity based collections.  This has several benefits.

* All the scaling code in hashed collections, both identity and equality 
if they scale values, can be thrown away.

* The index for an object becomes object hash "or scaledIdentityHash" \\ 
tableSize + 1 in all cases.  That's so little work the whole method in 
which it is implemented can be inlined, removing another message send 
and making collections even more efficient.

* Other senders of identityHash are not affected, particularly those 
that assume identityHash has a certain number of bits and so on.


Finally, I am assuming that chaining means that each slot in the hash 
table has a pointer that says where the collision chain continues (or a 
node that points to the next node in the collision).  It may be 
worthwhile to entertain using hash buckets, where the whole chain is in 
an array, particularly for identity collections if there's a primitive 
that scans for an object in an array.  That structure is also useful to 
implement the symbol table (and all sorts of other tables, actually).

As Martin says, mod special cases where you know you will have a lot of 
collisions, open addressing linear probing is the way to go.  And, for 
special cases such as that of identityHash, it may be worthwhile 
considering comparing using == but implementing identityHash in terms of 
hash.  After all, certainly if two objects are == they should have the 
same hash (and identityHash) values.

Andres.

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Reply via email to