thanks for this cool mail!
On Oct 29, 2009, at 12:51 AM, Andres Valloud wrote: > Stephane et al, > > Martin and I discussed this yesterday night at the PDX Smalltalk user > group meeting. Mostly we agree, with perhaps one small exception. In > order to make hashed collections simpler and identity hash handling > more > straightforward, hashed collections should get a small integer range > identity hash value. Something between 0 and SmallInteger maxVal, for > example. The issue is how to get that. > > Martin's approach is to change identityHash itself, and provide > primIdentityHash to access the existing identityHash implementation > (primitive 75). My approach is to leave identityHash as is, and > provide > a new message called scaledIdentityHash. Note that in both Martin's > and > my implementation, the new identity hash values are identical. In > other > words, whereas Martin's changes implement > > identityHash > > ^self primIdentityHash bitShift: 18 > > > my changes implement > > scaledIdentityHash > > ^self identityHash bitShift: 18 > > > So, really, this is an issue of how to implement the *same* behavior, > with the *same* goals. Now, there are advantages and disadvantages > for > both. I'll try to explain. Martin, please feel free to correct what > follows if I miss something. > > There are two kinds of senders of identityHash. Some (most) do not > care > whether the answer is well distributed or not. Typically, these are > implementors of #hash, for example: > > SomeObject>>hash > > ^self someInstanceVariable identityHash > > > If, on one hand, we change identityHash, then the behavior of these > hash > methods improve because the values they answer would be distributed > more > uniformly. Nevertheless, precisely because they send identityHash, it > could be argued that these hash methods were implemented assuming > there > would be just a few of these objects in hashed collections. If this > turns out not to be the case, then performance could be much better by > implementing a better hash method. > > Some (a few) other senders of identityHash assume that the answer > has a > particular distribution (in Squeak's/Pharo's case, 0 through 4095). > Consequently, the senders may do things like this: > > SomeObject>>hash > > ^(self variableA identityHash bitShift: 12) + self variableB > identityHash > > > Also, there may be hashed collections which use identityHash to index > hash buckets (e.g.: GemStone's own 32 bit object cache for > VisualWorks, > which uses identityHash to access hash buckets). So, if we change > identityHash, these senders which required special care to write would > break, requiring them to be written like this: > > SomeObject>>hash > > ^(self variableA primIdentityHash bitShift: 12) + self variableB > primIdentityHash > > > If, on the other hand, we provide a new scaledIdentityHash method, > then > all senders of identityHash keep behaving the same way as they do now. > Hashed collections, which currently send identityHash, would need to > change so that they send scaledIdentityHash instead. Furthermore, any > implementor of hash such as > > SomeObject>>hash > > ^self someInstanceVariable identityHash > > > would need to be rewritten as > > SomeObject>>hash > > ^self someInstanceVariable scaledIdentityHash > > > if hashed collections are expected to contain numerous instances of > SomeObject. If these are not rewritten, then they would continue to > have the same performance characteristics they have today (which may > be > adequate to begin with). Clever hacks such as > > SomeObject>>hash > > ^(self variableA identityHash bitShift: 12) + self variableB > identityHash > > > would also remain undisturbed. Finally, I do not know of any > Smalltalk > in which identityHash does not answer the actual object header bits > for > the identity hash. If we change identityHash, then AFAIK Pharo would > become the only Smalltalk in which identityHash does not answer > consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo, k=14 > for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15 for > VA > and VisualSmalltalk). Moreover, AFAIK, identityHash has had this > behavior for decades. Is it a good idea to change it? I am not so > sure > on that one. Also, I wouldn't want to break clever hacks that assume > identityHash behaves the way it currently does, and I wouldn't want to > make Pharo different gratuitously. > > So that's where we stand. I think I understand where Martin is coming > from, and I think Martin understands where I am coming from. One > way or > the other, we also agree that something should be done to make hashed > collections faster in Pharo. > > Andres. > > > Stéphane Ducasse wrote: >> Martin >> >> so what you are saying is that we can gain speed. And that these >> changes are beneficial for simple default. >> >> >> >>> The test code, a variant of the test from the HashTable SqueakSource >>> wiki, is at the bottom of this message. Basically, it adds a bunch >>> of instances of Object to a Dictionary and sees how much time it >>> takes to look them up again. >>> >>> From the graphs in the previous message, you can see that >>> performance for sizes > 4000 is greatly improved. For size = 10000, >>> #at: is 1000 times faster, 2-3 microseconds vs. >2 milliseconds. At >>> size=10000, #at:put: is about 200 times faster, ~10 microseconds vs. >>> >>>> 2 milliseconds, and the large spikes for growing the collection are >>>> >>> 21 milliseconds vs. > 4 seconds, again a factor of about 200. >>> >>> Performance for dictionary sizes < 4000 is essentially the same as >>> before, so these collections can serve as general-purpose >>> collections over a wide range of sizes. I've attached the graphs for >>> sizes <4000 to this message so you can see that more clearly than on >>> the previous graphs. >>> >>> These results should hold for any object that inherits #hash from >>> Object, in other words uses its identity hash as its equality hash. >>> Other objects with better hashes did not have as serious a problem, >>> but will probably show increased performance as well due to using >>> prime table sizes. >>> >>> These changes are in Set, so should improve Set's subclasses as >>> well. IdentityDictionary, IdentitySet, and WeakIdentityKeyDictionary >>> did not have as serious a problem, but should see some improvement. >>> MethodDictionaries have been left alone on the assumption that the >>> VM depends on the hashing of those. >>> >> >> So where are these changes :) >> Should we integrate them? >> Andres and other hash freaks what are your point of view? >> >> >>> Since there are still only 4K possible values of identity hash, >>> collisions are inevitable in large collections, and the number of >>> collisions will grow linearly with collection size. So how well does >>> the spread hash / prime table size do at even larger sizes? I ran >>> the same test at sizes of one million. As expected, access was quite >>> a bit slower than it had been at 10000. Time for #at: was ~250 >>> microseconds, and #at:put: was about the same. Note, however, that >>> this is still ten times faster than Dictionary previously was at a >>> size of 10000; 100 times larger yet 10 times faster. >>> >>> I had a lot of fun doing this. This is better results than I >>> expected, for a fairly minor (though deep) code change. >>> >> >> So where are these changes :) >> Should we integrate them? >> Andres and other hash freaks what are your point of view? >> >> >> >>> | test ord | >>> Transcript cr. >>> test := Dictionary new. >>> [ test size >= 10000] whileFalse: [ >>> ord := OrderedCollection new: 100. >>> Transcript show: >>> [ >>> 100 timesRepeat: [ >>> test at: ( ord add: Object new ) put: nil ]. >>> ] timeToRun asString. >>> Transcript tab; >>> show: test size asString; >>> tab. >>> Transcript show: >>> [ >>> 1000 timesRepeat: [ >>> ord do: [ :each | test at: each ] ] >>> ] timeToRun asString. >>> Transcript tab; show: >>> [ >>> 1000 timesRepeat: [ >>> ord do: [ :each | ] ] >>> ] timeToRun asString. >>> Transcript cr ] >>> >>> < >>> SmallDictPerformanceGraphs1 >>> .png>_______________________________________________ >>> Pharo-project mailing list >>> Pharo-project@lists.gforge.inria.fr >>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >>> >> >> >> _______________________________________________ >> Pharo-project mailing list >> Pharo-project@lists.gforge.inria.fr >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> . >> >> > > _______________________________________________ > Pharo-project mailing list > Pharo-project@lists.gforge.inria.fr > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list Pharo-project@lists.gforge.inria.fr http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project