STINNER Victor <victor.stin...@haypocalc.com> added the comment:

I tried the collision counting with a low number of collisions:

less than 15 collisions
-----------------------

Fail at startup.

5 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f
10 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f

dict((str(k), 0) for k in range(2000000))
-----------------------------------------

15 collisions (32,768 buckets, 18024 used=55.0%): hash=0e4631d2 => 31d2
20 collisions (131,072 buckets, 81568 used=62.2%): hash=12660719 => 719
25 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
30 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
35 collisions => ? (more than 10,000,000 integers)

random_dict('', 50000, charset, 1, 3)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyz0123456789'

15 collisions (8192 buckets, 5083 used=62.0%): hash=1526677a => 77a
20 collisions (32768 buckets, 19098 used=58.3%): hash=5d7760e6 => 60e6
25 collisions => <unable to generate a new key>

random_dict('', 50000, charset, 1, 3)
--------------------------------------

charset = 
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'

15 collisions (32768 buckets, 20572 used=62.8%): hash=789fe1e6 => 61e6
20 collisions (2048 buckets, 1297 used=63.3%): hash=2052533d => 33d
25 collisions => nope

random_dict('', 50000, charset, 1, 10)
--------------------------------------

charset = 
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'

15 collisions (32768 buckets, 18964 used=57.9%): hash=94d7c4f5 => 44f5
20 collisions (32768 buckets, 21548 used=65.8%): hash=acb5b39e => 339e
25 collisions (8192 buckets, 5395 used=65.9%): hash=04d367ae => 7ae
30 collisions => nope

random_dict() comes from the following script:
***
import random

def random_string(charset, minlen, maxlen):
    strlen = random.randint(minlen, maxlen)
    return ''.join(random.choice(charset) for index in xrange(strlen))

def random_dict(prefix, count, charset, minlen, maxlen):
    dico = {}
    keys = set()
    for index in xrange(count):
        for tries in xrange(10000):
            key = prefix + random_string(charset, minlen, maxlen)
            if key in keys:
                continue
            keys.add(key)
            break
        else:
            raise ValueError("unable to generate a new key")
        dico[key] = None

charset = 
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'
charset = 'abcdefghijklmnopqrstuvwxyz0123456789'
random_dict('', 50000, charset, 1, 3)
***

I ran the Django test suite. With a limit of 20 collisions, 60 tests
fail. With a limit of 50 collisions, there is no failure. But I don't
think that the test suite uses large data sets.

I also triend the Django test suite with a randomized hash function.
There are 46 failures. Many (all?) are related to the order of dict
keys: repr(dict) or indirectly in a HTML output. I didn't analyze all
failures. I suppose that Django can simply run the test suite using
PYTHONHASHSEED=0 (disable the randomized hash function), at least in a
first time.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue13703>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to