Andres Valloud wrote: > Stephane et al, > > Martin and I discussed this yesterday night at the PDX Smalltalk user > group meeting. Mostly we agree, with perhaps one small exception. In > order to make hashed collections simpler and identity hash handling more > straightforward, hashed collections should get a small integer range > identity hash value. Something between 0 and SmallInteger maxVal, for > example. The issue is how to get that.
Hi Andres, Oh, good; you started the discussion while I was being interrupted in the middle of writing my previous message -- thanks. :-) [...good description of the choices omitted...] > There are two kinds of senders of identityHash. [...] I took a survey of the senders of #identityHash in the latest web image. There aren't that many. The largest category is those that want the printString of the identityHash. Of those that care about the value of the identityHash, there are several that use it in #hash methods. The most common is this definition: hash ^self identityHash These are presumably overriding superclass behavior to restore Object behavior. If Object>>hash were to be defined to answer 'self scaledIdentityHash' then these methods would also need to be changed in order to get the performance improvement for collections containing those objects. If, OTOH, #identityHash itself changes, these classes get the benefit with no change. > Some (most) do not care >> whether the answer is well distributed or not. Typically, these are >> implementors of #hash, for example: >> >> SomeObject>>hash >> >> ^self someInstanceVariable identityHash And indeed there are also some do this. > If, on one hand, we change identityHash, then the behavior of these hash > methods improve because the values they answer would be distributed more > uniformly. Nevertheless, precisely because they send identityHash, it > could be argued that these hash methods were implemented assuming there > would be just a few of these objects in hashed collections. If the authors knew about the limited range of #identityHash, that is entirely possible. I tend to think it more likely that in most cases these implementations are just the simplest way to follow the dictate that 'a=b -> a hash = b hash', and that they didn't really think about the impact on collection performance. > > Some (a few) other senders of identityHash assume that the answer has a > particular distribution (in Squeak's/Pharo's case, 0 through 4095). Here are the senders I found that care about the actual values returned, along with whether they'd be harmed or improved by increasing the range returned: FixedIdentitySet -- harmed IdentityDictionary -- somewhat improved IdentitySet -- somewhat improved KeyedIdentitySet -- greatly improved MethodDictionary -- harmed RefactoryTyper -- probably improved WeakIdentityKeyDictionary -- somewhat improved 5 improved, 2 harmed. And one of the listed harmed is MethodDictionary, whose performance would not be harmed, but I assume the VM would not be happy if their hashing was changed (anybody know for sure whether that's true?) > Consequently, the senders may do things like this: > > SomeObject>>hash > > ^(self variableA identityHash bitShift: 12) + self variableB identityHash They could, and I admit to having written this kind of code in the past, but I doubt that I'm typical in doing so. Do you know of any Pharo code that actually *does* this sort of thing? There isn't any in the distributed web image, but I didn't look at every package that is meant to be loadable in Pharo. > > > If, on the other hand, we provide a new scaledIdentityHash method, then > all senders of identityHash keep behaving the same way as they do now. > Hashed collections, which currently send identityHash, would need to > change so that they send scaledIdentityHash instead. Furthermore, any > implementor of hash such as > > SomeObject>>hash > > ^self someInstanceVariable identityHash > > > would need to be rewritten as > > SomeObject>>hash > > ^self someInstanceVariable scaledIdentityHash Yes. I believe that this is a disadvantage of the #scaledIdentityHash approach. > > > if hashed collections are expected to contain numerous instances of > SomeObject. If these are not rewritten, then they would continue to > have the same performance characteristics they have today (which may be > adequate to begin with). True, if the collections are expected to remain small their performance is already adequate. However, small collections will still have good performance if #identityHash is changed, and collections that the authors thought would remain small when they were written back in the days of sub-megabyte memory may now get bigger than their authors intended. > Clever hacks such as > > SomeObject>>hash > > ^(self variableA identityHash bitShift: 12) + self variableB identityHash > > > would also remain undisturbed. Yes, if #identityHash is changed it's the clever hacks that will have to change. This could be a disadvantage of this approach, but often, as in the case of IdentityDictionary, IdentitySet, and WeakIdentityKeyDictionary, the necessary change is simply to remove the clever hack, get simpler code, and enjoy better performance than you got with the clever hack, so making the change is IMO an improvement. > Finally, I do not know of any Smalltalk > in which identityHash does not answer the actual object header bits for > the identity hash. If we change identityHash, then AFAIK Pharo would > become the only Smalltalk in which identityHash does not answer > consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo, k=14 > for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15 for VA > and VisualSmalltalk). GemStone is a Smalltalk that does not answer consecutive values for identityHash. In GemStone the identityHash is computed from the object's OOP, and OOPs are not consecutive. Every object's identityHash is different, the range is not limited (at least in 32-bit GemStone; I'd have to check what we do in 64-bit). And Smalltalk-80 basically used the same scheme, though you could only have 32K objects, every one had a different identityHash based on OOP. Also, most (all?) Smalltalks with limited ranges for identityHash do have a larger range of identityHash for SmallIntegers (usually ^self), so you can't use the clever hacks if you might have any SmallIntegers in your collection. So any general-purpose collection must already deal with the full SmallInteger range of identity hashes as keys, cannot use the clever hacks, and so is likely to only be improved by changing #identityHash. This is a key point that I forgot to bring up last night. > > So that's where we stand. I think I understand where Martin is coming > from, and I think Martin understands where I am coming from. One way or > the other, we also agree that something should be done to make hashed > collections faster in Pharo. Yes, Andres and I are in far more agreement than disagreement about the hashing changes. Regards, -Martin _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
