Re: [Pharo-project] Hashed collection changes, the performance graphs

Stéphane Ducasse Sat, 31 Oct 2009 05:11:41 -0700

thanks for this cool mail!


On Oct 29, 2009, at 12:51 AM, Andres Valloud wrote:

> Stephane et al,
>
> Martin and I discussed this yesterday night at the PDX Smalltalk user
> group meeting.  Mostly we agree, with perhaps one small exception.  In
> order to make hashed collections simpler and identity hash handling  
> more
> straightforward, hashed collections should get a small integer range
> identity hash value.  Something between 0 and SmallInteger maxVal, for
> example.  The issue is how to get that.
>
> Martin's approach is to change identityHash itself, and provide
> primIdentityHash to access the existing identityHash implementation
> (primitive 75).  My approach is to leave identityHash as is, and  
> provide
> a new message called scaledIdentityHash.  Note that in both Martin's  
> and
> my implementation, the new identity hash values are identical.  In  
> other
> words, whereas Martin's changes implement
>
> identityHash
>
> ^self primIdentityHash bitShift: 18
>
>
> my changes implement
>
> scaledIdentityHash
>
> ^self identityHash bitShift: 18
>
>
> So, really, this is an issue of how to implement the *same* behavior,
> with the *same* goals.  Now, there are advantages and disadvantages  
> for
> both.  I'll try to explain.  Martin, please feel free to correct what
> follows if I miss something.
>
> There are two kinds of senders of identityHash.  Some (most) do not  
> care
> whether the answer is well distributed or not.  Typically, these are
> implementors of #hash, for example:
>
> SomeObject>>hash
>
> ^self someInstanceVariable identityHash
>
>
> If, on one hand, we change identityHash, then the behavior of these  
> hash
> methods improve because the values they answer would be distributed  
> more
> uniformly.  Nevertheless, precisely because they send identityHash, it
> could be argued that these hash methods were implemented assuming  
> there
> would be just a few of these objects in hashed collections.  If this
> turns out not to be the case, then performance could be much better by
> implementing a better hash method.
>
> Some (a few) other senders of identityHash assume that the answer  
> has a
> particular distribution (in Squeak's/Pharo's case, 0 through 4095).
> Consequently, the senders may do things like this:
>
> SomeObject>>hash
>
> ^(self variableA identityHash bitShift: 12) + self variableB  
> identityHash
>
>
> Also, there may be hashed collections which use identityHash to index
> hash buckets (e.g.: GemStone's own 32 bit object cache for  
> VisualWorks,
> which uses identityHash to access hash buckets).  So, if we change
> identityHash, these senders which required special care to write would
> break, requiring them to be written like this:
>
> SomeObject>>hash
>
> ^(self variableA primIdentityHash bitShift: 12) + self variableB
> primIdentityHash
>
>
> If, on the other hand, we provide a new scaledIdentityHash method,  
> then
> all senders of identityHash keep behaving the same way as they do now.
> Hashed collections, which currently send identityHash, would need to
> change so that they send scaledIdentityHash instead.  Furthermore, any
> implementor of hash such as
>
> SomeObject>>hash
>
> ^self someInstanceVariable identityHash
>
>
> would need to be rewritten as
>
> SomeObject>>hash
>
> ^self someInstanceVariable scaledIdentityHash
>
>
> if hashed collections are expected to contain numerous instances of
> SomeObject.  If these are not rewritten, then they would continue to
> have the same performance characteristics they have today (which may  
> be
> adequate to begin with).  Clever hacks such as
>
> SomeObject>>hash
>
> ^(self variableA identityHash bitShift: 12) + self variableB  
> identityHash
>
>
> would also remain undisturbed.  Finally, I do not know of any  
> Smalltalk
> in which identityHash does not answer the actual object header bits  
> for
> the identity hash.  If we change identityHash, then AFAIK Pharo would
> become the only Smalltalk in which identityHash does not answer
> consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo, k=14
> for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15 for  
> VA
> and VisualSmalltalk).  Moreover, AFAIK, identityHash has had this
> behavior for decades.  Is it a good idea to change it?  I am not so  
> sure
> on that one.  Also, I wouldn't want to break clever hacks that assume
> identityHash behaves the way it currently does, and I wouldn't want to
> make Pharo different gratuitously.
>
> So that's where we stand.  I think I understand where Martin is coming
> from, and I think Martin understands where I am coming from.  One  
> way or
> the other, we also agree that something should be done to make hashed
> collections faster in Pharo.
>
> Andres.
>
>
> Stéphane Ducasse wrote:
>> Martin
>>
>> so what you are saying is that we can gain speed. And that these
>> changes are beneficial for simple default.
>>
>>
>>
>>> The test code, a variant of the test from the HashTable SqueakSource
>>> wiki, is at the bottom of this message. Basically, it adds a bunch
>>> of instances of Object to a Dictionary and sees how much time it
>>> takes to look them up again.
>>>
>>> From the graphs in the previous message, you can see that
>>> performance for sizes > 4000 is greatly improved. For size = 10000,
>>> #at: is 1000 times faster, 2-3 microseconds vs. >2 milliseconds. At
>>> size=10000, #at:put: is about 200 times faster, ~10 microseconds vs.
>>>
>>>> 2 milliseconds, and the large spikes for growing the collection are
>>>>
>>> 21 milliseconds vs. > 4 seconds, again a factor of about 200.
>>>
>>> Performance for dictionary sizes < 4000 is essentially the same as
>>> before, so these collections can serve as general-purpose
>>> collections over a wide range of sizes. I've attached the graphs for
>>> sizes <4000 to this message so you can see that more clearly than on
>>> the previous graphs.
>>>
>>> These results should hold for any object that inherits #hash from
>>> Object, in other words uses its identity hash as its equality hash.
>>> Other objects with better hashes did not have as serious a problem,
>>> but will probably show increased performance as well due to using
>>> prime table sizes.
>>>
>>> These changes are in Set, so should improve Set's subclasses as
>>> well. IdentityDictionary, IdentitySet, and WeakIdentityKeyDictionary
>>> did not have as serious a problem, but should see some improvement.
>>> MethodDictionaries have been left alone on the assumption that the
>>> VM depends on the hashing of those.
>>>
>>
>> So where are these changes :)
>> Should we integrate them?
>> Andres and other hash freaks what are your point of view?
>>
>>
>>> Since there are still only 4K possible values of identity hash,
>>> collisions are inevitable in large collections, and the number of
>>> collisions will grow linearly with collection size. So how well does
>>> the spread hash / prime table size do at even larger sizes? I ran
>>> the same test at sizes of one million. As expected, access was quite
>>> a bit slower than it had been at 10000. Time for #at: was ~250
>>> microseconds, and #at:put: was about the same. Note, however, that
>>> this is still ten times faster than Dictionary previously was at a
>>> size of 10000; 100 times larger yet 10 times faster.
>>>
>>> I had a lot of fun doing this. This is better results than I
>>> expected, for a fairly minor (though deep) code change.
>>>
>>
>> So where are these changes :)
>> Should we integrate them?
>> Andres and other hash freaks what are your point of view?
>>
>>
>>
>>> | test ord |
>>> Transcript cr.
>>> test := Dictionary new.
>>> [ test size >= 10000] whileFalse: [
>>> ord := OrderedCollection new: 100.
>>> Transcript show:
>>> [
>>> 100 timesRepeat: [
>>> test at: ( ord add: Object new ) put: nil ].
>>> ] timeToRun asString.
>>> Transcript tab;
>>> show: test size asString;
>>> tab.
>>> Transcript show:
>>> [
>>> 1000 timesRepeat: [
>>> ord do: [ :each | test at: each ] ]
>>> ] timeToRun asString.
>>> Transcript tab; show:
>>> [
>>> 1000 timesRepeat: [
>>> ord do: [ :each | ] ]
>>> ] timeToRun asString.
>>> Transcript cr ]
>>>
>>> <
>>> SmallDictPerformanceGraphs1
>>> .png>_______________________________________________
>>> Pharo-project mailing list
>>> Pharo-project@lists.gforge.inria.fr
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> Pharo-project@lists.gforge.inria.fr
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>> .
>>
>>
>
> _______________________________________________
> Pharo-project mailing list
> Pharo-project@lists.gforge.inria.fr
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
Pharo-project@lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] Hashed collection changes, the performance graphs

Reply via email to