Re: [Pharo-project] Hashed collection changes, the performance graphs

Andres Valloud Wed, 28 Oct 2009 16:52:01 -0700

Stephane et al,

Martin and I discussed this yesterday night at the PDX Smalltalk user 
group meeting.  Mostly we agree, with perhaps one small exception.  In 
order to make hashed collections simpler and identity hash handling more 
straightforward, hashed collections should get a small integer range 
identity hash value.  Something between 0 and SmallInteger maxVal, for 
example.  The issue is how to get that.

Martin's approach is to change identityHash itself, and provide 
primIdentityHash to access the existing identityHash implementation 
(primitive 75).  My approach is to leave identityHash as is, and provide 
a new message called scaledIdentityHash.  Note that in both Martin's and 
my implementation, the new identity hash values are identical.  In other 
words, whereas Martin's changes implement

identityHash

  ^self primIdentityHash bitShift: 18

my changes implement

scaledIdentityHash

  ^self identityHash bitShift: 18

So, really, this is an issue of how to implement the *same* behavior, 
with the *same* goals.  Now, there are advantages and disadvantages for 
both.  I'll try to explain.  Martin, please feel free to correct what 
follows if I miss something.

There are two kinds of senders of identityHash.  Some (most) do not care 
whether the answer is well distributed or not.  Typically, these are 
implementors of #hash, for example:

SomeObject>>hash

  ^self someInstanceVariable identityHash

If, on one hand, we change identityHash, then the behavior of these hash 
methods improve because the values they answer would be distributed more 
uniformly.  Nevertheless, precisely because they send identityHash, it 
could be argued that these hash methods were implemented assuming there 
would be just a few of these objects in hashed collections.  If this 
turns out not to be the case, then performance could be much better by 
implementing a better hash method.

Some (a few) other senders of identityHash assume that the answer has a 
particular distribution (in Squeak's/Pharo's case, 0 through 4095).  
Consequently, the senders may do things like this:

SomeObject>>hash

  ^(self variableA identityHash bitShift: 12) + self variableB identityHash

Also, there may be hashed collections which use identityHash to index 
hash buckets (e.g.: GemStone's own 32 bit object cache for VisualWorks, 
which uses identityHash to access hash buckets).  So, if we change 
identityHash, these senders which required special care to write would 
break, requiring them to be written like this:

SomeObject>>hash

  ^(self variableA primIdentityHash bitShift: 12) + self variableB 
primIdentityHash

If, on the other hand, we provide a new scaledIdentityHash method, then 
all senders of identityHash keep behaving the same way as they do now.  
Hashed collections, which currently send identityHash, would need to 
change so that they send scaledIdentityHash instead.  Furthermore, any 
implementor of hash such as

SomeObject>>hash

  ^self someInstanceVariable identityHash

would need to be rewritten as

SomeObject>>hash

  ^self someInstanceVariable scaledIdentityHash

if hashed collections are expected to contain numerous instances of 
SomeObject.  If these are not rewritten, then they would continue to 
have the same performance characteristics they have today (which may be 
adequate to begin with).  Clever hacks such as

SomeObject>>hash

  ^(self variableA identityHash bitShift: 12) + self variableB identityHash

would also remain undisturbed.  Finally, I do not know of any Smalltalk 
in which identityHash does not answer the actual object header bits for 
the identity hash.  If we change identityHash, then AFAIK Pharo would 
become the only Smalltalk in which identityHash does not answer 
consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo, k=14 
for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15 for VA 
and VisualSmalltalk).  Moreover, AFAIK, identityHash has had this 
behavior for decades.  Is it a good idea to change it?  I am not so sure 
on that one.  Also, I wouldn't want to break clever hacks that assume 
identityHash behaves the way it currently does, and I wouldn't want to 
make Pharo different gratuitously.

So that's where we stand.  I think I understand where Martin is coming 
from, and I think Martin understands where I am coming from.  One way or 
the other, we also agree that something should be done to make hashed 
collections faster in Pharo.

Andres.

Stéphane Ducasse wrote:
> Martin
>
> so what you are saying is that we can gain speed. And that these  
> changes are beneficial for simple default.
>
>
>   
>> The test code, a variant of the test from the HashTable SqueakSource  
>> wiki, is at the bottom of this message. Basically, it adds a bunch  
>> of instances of Object to a Dictionary and sees how much time it  
>> takes to look them up again.
>>
>> From the graphs in the previous message, you can see that  
>> performance for sizes > 4000 is greatly improved. For size = 10000,  
>> #at: is 1000 times faster, 2-3 microseconds vs. >2 milliseconds. At  
>> size=10000, #at:put: is about 200 times faster, ~10 microseconds vs.  
>>     
>>> 2 milliseconds, and the large spikes for growing the collection are  
>>>       
>> 21 milliseconds vs. > 4 seconds, again a factor of about 200.
>>
>> Performance for dictionary sizes < 4000 is essentially the same as  
>> before, so these collections can serve as general-purpose  
>> collections over a wide range of sizes. I've attached the graphs for  
>> sizes <4000 to this message so you can see that more clearly than on  
>> the previous graphs.
>>
>> These results should hold for any object that inherits #hash from  
>> Object, in other words uses its identity hash as its equality hash.  
>> Other objects with better hashes did not have as serious a problem,  
>> but will probably show increased performance as well due to using  
>> prime table sizes.
>>
>> These changes are in Set, so should improve Set's subclasses as  
>> well. IdentityDictionary, IdentitySet, and WeakIdentityKeyDictionary  
>> did not have as serious a problem, but should see some improvement.  
>> MethodDictionaries have been left alone on the assumption that the  
>> VM depends on the hashing of those.
>>     
>
> So where are these changes :)
> Should we integrate them?
> Andres and other hash freaks what are your point of view?
>
>   
>> Since there are still only 4K possible values of identity hash,  
>> collisions are inevitable in large collections, and the number of  
>> collisions will grow linearly with collection size. So how well does  
>> the spread hash / prime table size do at even larger sizes? I ran  
>> the same test at sizes of one million. As expected, access was quite  
>> a bit slower than it had been at 10000. Time for #at: was ~250  
>> microseconds, and #at:put: was about the same. Note, however, that  
>> this is still ten times faster than Dictionary previously was at a  
>> size of 10000; 100 times larger yet 10 times faster.
>>
>> I had a lot of fun doing this. This is better results than I  
>> expected, for a fairly minor (though deep) code change.
>>     
>
> So where are these changes :)
> Should we integrate them?
> Andres and other hash freaks what are your point of view?
>
>
>   
>> | test ord |
>> Transcript cr.
>> test := Dictionary new.
>> [ test size >= 10000] whileFalse: [
>> ord := OrderedCollection new: 100.
>> Transcript show:
>> [
>> 100 timesRepeat: [
>> test at: ( ord add: Object new ) put: nil ].
>> ] timeToRun asString.
>> Transcript tab;
>> show: test size asString;
>> tab.
>> Transcript show:
>> [
>> 1000 timesRepeat: [
>> ord do: [ :each | test at: each ] ]
>> ] timeToRun asString.
>> Transcript tab; show:
>> [
>> 1000 timesRepeat: [
>> ord do: [ :each | ] ]
>> ] timeToRun asString.
>> Transcript cr ]
>>
>> < 
>> SmallDictPerformanceGraphs1 
>> .png>_______________________________________________
>> Pharo-project mailing list
>> [email protected]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>     
>
>
> _______________________________________________
> Pharo-project mailing list
> [email protected]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
> .
>
>   

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] Hashed collection changes, the performance graphs

Reply via email to