Re: [Pharo-project] Hashed collection changes, the performance graphs

Martin McClure Wed, 28 Oct 2009 18:28:49 -0700

Andres Valloud wrote:
> Stephane et al,
> 
> Martin and I discussed this yesterday night at the PDX Smalltalk user 
> group meeting.  Mostly we agree, with perhaps one small exception.  In 
> order to make hashed collections simpler and identity hash handling more 
> straightforward, hashed collections should get a small integer range 
> identity hash value.  Something between 0 and SmallInteger maxVal, for 
> example.  The issue is how to get that.


Hi Andres,

Oh, good; you started the discussion while I was being interrupted in 
the middle of writing my previous message -- thanks. :-)

[...good description of the choices omitted...]


> There are two kinds of senders of identityHash.

[...]

I took a survey of the senders of #identityHash in the latest web image. 
There aren't that many. The largest category is those that want the 
printString of the identityHash.

Of those that care about the value of the identityHash, there are 
several that use it in #hash methods. The most common is this definition:

hash
   ^self identityHash

These are presumably overriding superclass behavior to restore Object 
behavior. If Object>>hash were to be defined to answer 'self 
scaledIdentityHash' then these methods would also need to be changed in 
order to get the performance improvement for collections containing 
those objects. If, OTOH, #identityHash itself changes, these classes get 
the benefit with no change.

>   Some (most) do not care 
>> whether the answer is well distributed or not.  Typically, these are 
>> implementors of #hash, for example:
>> 
>> SomeObject>>hash
>> 
>>   ^self someInstanceVariable identityHash

And indeed there are also some do this.

> If, on one hand, we change identityHash, then the behavior of these hash 
> methods improve because the values they answer would be distributed more 
> uniformly.  Nevertheless, precisely because they send identityHash, it 
> could be argued that these hash methods were implemented assuming there 
> would be just a few of these objects in hashed collections.  

If the authors knew about the limited range of #identityHash, that is 
entirely possible. I tend to think it more likely that in most cases 
these implementations are just the simplest way to follow the dictate 
that 'a=b -> a hash = b hash', and that they didn't really think about 
the impact on collection performance.


> 
> Some (a few) other senders of identityHash assume that the answer has a 
> particular distribution (in Squeak's/Pharo's case, 0 through 4095).  

Here are the senders I found that care about the actual values returned, 
along with whether they'd be harmed or improved by increasing the range 
returned:

FixedIdentitySet -- harmed
IdentityDictionary -- somewhat improved
IdentitySet -- somewhat improved
KeyedIdentitySet -- greatly improved
MethodDictionary -- harmed
RefactoryTyper -- probably improved
WeakIdentityKeyDictionary -- somewhat improved

5 improved, 2 harmed. And one of the listed harmed is MethodDictionary, 
whose performance would not be harmed, but I assume the VM would not be 
happy if their hashing was changed (anybody know for sure whether that's 
true?)


> Consequently, the senders may do things like this:
> 
> SomeObject>>hash
> 
>   ^(self variableA identityHash bitShift: 12) + self variableB identityHash

They could, and I admit to having written this kind of code in the past, 
but I doubt that I'm typical in doing so. Do you know of any Pharo code 
that actually *does* this sort of thing? There isn't any in the 
distributed web image, but I didn't look at every package that is meant 
to be loadable in Pharo.

> 
> 
> If, on the other hand, we provide a new scaledIdentityHash method, then 
> all senders of identityHash keep behaving the same way as they do now.  
> Hashed collections, which currently send identityHash, would need to 
> change so that they send scaledIdentityHash instead.  Furthermore, any 
> implementor of hash such as
> 
> SomeObject>>hash
> 
>   ^self someInstanceVariable identityHash
> 
> 
> would need to be rewritten as
> 
> SomeObject>>hash
> 
>   ^self someInstanceVariable scaledIdentityHash

Yes. I believe that this is a disadvantage of the #scaledIdentityHash 
approach.

> 
> 
> if hashed collections are expected to contain numerous instances of 
> SomeObject.  If these are not rewritten, then they would continue to 
> have the same performance characteristics they have today (which may be 
> adequate to begin with).  

True, if the collections are expected to remain small their performance 
is already adequate. However, small collections will still have good 
performance if #identityHash is changed, and collections that the 
authors thought would remain small when they were written back in the 
days of sub-megabyte memory may now get bigger than their authors intended.

> Clever hacks such as
> 
> SomeObject>>hash
> 
>   ^(self variableA identityHash bitShift: 12) + self variableB identityHash
> 
> 
> would also remain undisturbed.  

Yes, if #identityHash is changed it's the clever hacks that will have to 
change. This could be a disadvantage of this approach, but often, as in 
the case of IdentityDictionary, IdentitySet, and 
WeakIdentityKeyDictionary, the necessary change is simply to remove the 
clever hack, get simpler code, and enjoy better performance than you got 
with the clever hack, so making the change is IMO an improvement.

> Finally, I do not know of any Smalltalk 
> in which identityHash does not answer the actual object header bits for 
> the identity hash. If we change identityHash, then AFAIK Pharo would
> become the only Smalltalk in which identityHash does not answer 
> consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo, k=14 
> for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15 for VA 
> and VisualSmalltalk).  

GemStone is a Smalltalk that does not answer consecutive values for 
identityHash. In GemStone the identityHash is computed from the object's 
OOP, and OOPs are not consecutive. Every object's identityHash is 
different, the range is not limited (at least in 32-bit GemStone; I'd 
have to check what we do in 64-bit). And Smalltalk-80 basically used the 
same scheme, though you could only have 32K objects, every one had a 
different identityHash based on OOP.

Also, most (all?) Smalltalks with limited ranges for identityHash do 
have a larger range of identityHash for SmallIntegers (usually ^self), 
so you can't use the clever hacks if you might have any SmallIntegers in 
your collection. So any general-purpose collection must already deal 
with the full SmallInteger range of identity hashes as keys, cannot use 
the clever hacks, and so is likely to only be improved by changing 
#identityHash. This is a key point that I forgot to bring up last night.

> 
> So that's where we stand.  I think I understand where Martin is coming 
> from, and I think Martin understands where I am coming from.  One way or 
> the other, we also agree that something should be done to make hashed 
> collections faster in Pharo.

Yes, Andres and I are in far more agreement than disagreement about the 
hashing changes.

Regards,

-Martin

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] Hashed collection changes, the performance graphs

Reply via email to