On Fri, 11 Aug 2017 12:58 am, Chris Angelico wrote:
> On Fri, Aug 11, 2017 at 12:45 AM, Steve D'Aprano
> <steve+pyt...@pearwood.info> wrote:
>> The C code says:
>>> /* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
>>> excessive hash collisions for dicts and sets */
>> which I think agrees with my comment: using the id() itself would put too
>> many objects in the same bucket (i.e. too many collisions).
>>> If that were the problem it wouldn't be solved by the current approach:
>>>>>> sample = [object() for _ in range(10)]
>>>>>> [hash(b) - hash(a) for a, b in zip(sample, sample[1:])]
>>> [1, 1, 1, 1, 1, 1, 1, 1, 1]
> A difference of 1 in a hash is usually going to mean dropping
> something into the next bucket. A difference of 4, 8, or 16 would mean
> that a tiny dictionary (which has 8 slots and thus uses modulo-8)
> would have everything on the same slot.
Um... yes? And how does that relate to the comment given in the source code?
"bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid excessive hash
collisions for dicts and sets"
According to the comment, IDs of objects are typically:
0b(bunch of bits)1000
0b(bunch of bits)0000
i.e. they're typically multiples of 8 or 16. Right? So modulo 8, they'll all map
to the zeroeth bucket; modulo 16, they'll all map to the zeroeth or eighth
bucket. Collisions, just like I said. (Was I wrong?)
By stripping of the first four bits, you get:
0b(bunch of bits)
which will hopefully be well-mixed modulo 8 or 16. Are we in agreement so far?
>> So my money is on object() being anomalous: because it is so small, the
>> hashes end up so similar. For "typical" classes, the hash function does a
>> much better job of mixing the hash values up.
> And this is also possible, but most likely the difference would simply
> widen; small dictionaries would still have all objects landing in the
> zeroth bucket. Hence rotating away the low bits.
I can't tell if you're disagreeing with me or not.
I said: using the object ID itself would be a terrible hash, because there would
be lots of collisions. Lo and behold the source code for the default hash says
(paraphrased), "don't use the ID itself, that will have lots of collisions" and
processes the ID by doing a >> 4 to strip off the least significant four bits.
So I'm genuinely puzzled on where (if anywhere) our point of disagreement is.
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.