Dmitry Rubanovich added the comment:

Yes, I do understand it.  

But the statement in lines 166, 167: "For any initial j in range(2**i), 
repeating that 2**i times generates each int in range(2**i) exactly once" does 
not hold when the perturb is added.  2**i times will not be enough to generate 
all elements of the ring when some of multipliers are zero-divisors.  

Specifically, if you use j=((5*j) + P + 1) mod 2**i, you are effectively 
multiplying by a zero divisors every time P = j mod 2.  And I don't mean "mod 
2**i."  I do mean "mod 2."  Which means anytime P (which changes a few times 
and eventually becomes 0), has the same parity as j, you are multiplying by a 
zero-divisor.  Because P is eventually 0, you will go through all the values 
**eventually**.  But for all values of P for which there is a P' such that 
P=/=P' and ((5*j) + P + 1) = ((5*j) + P' + 1) mod 2**i, the number of times 
you'll need to apply the function will be higher than 2**i.  And that's in the 
worst case scenario.

In the best case scenario, the collision probability can be gathered from 
looking at the values of my script printed by 
print_collision_counts(py3_6_1_lookdict_perturb).  It can be as low as ~1/20 
and as high as ~1/3 depending on the value of i.

The main speed up we are seeking is to avoid a collision early on.  And we are 
less concerned about collisions later on.  Anecdotally, if your dict has 256 
buckets, then the chance of collision is 1 in ~3.5.  Which is an improvement 
over 1 in 2, but still pretty high.  

Ok, how about this: to avoid the edge cases, unroll the 1st secondary hash key 
to use j = 2*j + P + 1.  So try to test for it before the loop.  But leave 5*j 
+ P + 1 in the loop as is.  

Although, to be honest if PERTURB_SHIFT is changed to 1 and we have a dict with 
a key that causes 64 collisions, this may be a good time to resize even if we 
are still sparse.  On the other hand, this might create an attack vector with 
some magic value which causes resizes of near empty dict's.  So maybe not...  
Certainly not without further analysis.

BTW, and it's mostly stylistic, but except for the "*hashpos = i & mask;" line 
in "find_empty_slot()", applying of the mask can be moved to "dk_get_index()".  
Again, I am looking at the released 3.6.1 code, so if this is already done, 
then never mind.

As another note, changing PERTURB_SHIFT to 1 causes a near-universal reduction 
in collisions (for all strategies).  So if nothing else, that's probably an 
improvement.

----------
Added file: http://bugs.python.org/file46956/fewer_collisions_count.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30671>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to