I apologize, the simulation code had a flaw. I'm embarrassed that I
didn't immediately recognize this immediately from intuition. We
could get even more accurate results by computing actual SHA-1 of
actual message-ids, but I'm not sure it is worth the effort. Here is a
revised program and the results from one run. For the record, I'm
interested in billion+ message counts, for which we'd need a few more
hash characters. But not that many.
message count 10000000, hash length 4, collisions 99.992930%
message count 10000000, hash length 5, collisions 25.755240%
message count 10000000, hash length 6, collisions 0.928050%
message count 10000000, hash length 7, collisions 0.029860%
message count 10000000, hash length 8, collisions 0.000800%
message count 10000000, hash length 9, collisions 0.000060%
message count 10000000, hash length 10, collisions 0.000000%
===
#!/usr/bin/python
import random
def compute(message_count, hashlength):
database = {}
for i in range(message_count):
n = random.randint(0, pow(2, 5 * hashlength))
if n in database:
database[n] += 1
else:
database[n] = 1
collisions = 0
for i in database:
if database[i] > 1:
collisions += database[i]
p1 = (100.0 * collisions) / float(message_count)
print("message count %d, hash length %d, collisions %f%%" %
(message_count, hashlength, p1))
for i in range(4, 11):
compute(10000000, i)
_______________________________________________
Mailman-Developers mailing list
[email protected]
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives:
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe:
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org
Security Policy: http://wiki.list.org/x/QIA9