Martin, > You are, of course, correct. Well yes, thank you :), but it also helps when good, qualified people go over one's assertions to make sure they're not wrong.
> -------------- > > First, the identity-based collections. > > The current IdentitySet and IdentityDictionary implementations do adjust > the hash function for large table sizes, with this code in #scanFor: > > finish > 4096 > ifTrue: [hash := anObject identityHash * (finish // 4096)] > ifFalse: [hash := anObject identityHash]. > start := (hash \\ finish) + 1. > > This makes IdentityDictionary pretty much as fast as it can be at most > sizes, but there's a problem here. Take a look at the HashTable > comparison graph: > > http://img247.echo.cx/img247/3702/chart8vv.png > > IdentityDictionary is fast at most sizes, but has problems right around > 4K of size. I'm going to assert (without testing, always dangerous in > performance work!) that this is because the hash spreading algorithm > above but always leaves an area at the end of the table array that never > gets any hash values mapped to it. This will be worst when the table > size is between 4K and 8K, when the multiplier is still 1. When the > number of objects in the dictionary gets high enough that the table > grows again, the hashes start being spread out and it gets fast again. > At first sight, it seems to me that this should be fixed along the lines of what you propose. > I'd suggest trying replacing the above code with something like this: > > hash := anObject identityHash * (finish // 4096 + 1) > start := (hash \\ finish) + 1. > > Not only better performing (I expect) but simpler. Possibly even > better-performing might be this: > > hash := anObject identityHash * <well-chosen constant> > start := (hash \\ finish) + 1. > > where the constant is large enough to get a spread across most of the > positive SmallInteger range, but not large enough to exceed it. Andres, > what's a good choice for the constant? Maybe a prime that is never close > to a prime used for a table size in these collections? > Actually, since Squeak/Pharo have hashMultiply, I'd guess that constant (1664525, IIRC) should suffice for the vast majority of cases (1664525 * 4095 > 1 billion, that should do). Hashed collection sizes (past the first trivial tiny sizes) should be chosen carefully. Knuth has some suggestions for how to select good prime sizes. In fact, 1664525 should be seen as one potential hashed collection size too (but never used as such). The hash book has all the details on this. > I'd suggest using a hash-spreading function on the equality-based > collections as well. Ideally, I'd do something along the lines of Object>>scaledIdentityHash ^self identityHash bitShift: 18 "I think, basically as much as possible without creating a large integer --- note that multiplying by a power of two has the effect of a permutation modulo a well chosen prime because gcd(2^k, wellChosenPrime) = 1 and so 2^k is inversible mod wellChosenPrime" implement Object>>hash ^self scaledIdentityHash "as opposed to identityHash" and then substitute sending scaledIdentityHash for identityHash in the identity based collections. This has several benefits. * All the scaling code in hashed collections, both identity and equality if they scale values, can be thrown away. * The index for an object becomes object hash "or scaledIdentityHash" \\ tableSize + 1 in all cases. That's so little work the whole method in which it is implemented can be inlined, removing another message send and making collections even more efficient. * Other senders of identityHash are not affected, particularly those that assume identityHash has a certain number of bits and so on. Finally, I am assuming that chaining means that each slot in the hash table has a pointer that says where the collision chain continues (or a node that points to the next node in the collision). It may be worthwhile to entertain using hash buckets, where the whole chain is in an array, particularly for identity collections if there's a primitive that scans for an object in an array. That structure is also useful to implement the symbol table (and all sorts of other tables, actually). As Martin says, mod special cases where you know you will have a lot of collisions, open addressing linear probing is the way to go. And, for special cases such as that of identityHash, it may be worthwhile considering comparing using == but implementing identityHash in terms of hash. After all, certainly if two objects are == they should have the same hash (and identityHash) values. Andres. _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
