I apologize, the simulation code had a flaw.  I'm embarrassed that I
didn't immediately recognize this immediately from intuition.  We
could get even more accurate results by computing actual SHA-1 of
actual message-ids, but I'm not sure it is worth the effort. Here is a
revised program and the results from one run. For the record, I'm
interested in billion+ message counts, for which we'd need a few more
hash characters. But not that many.

message count 10000000, hash length 4, collisions 99.992930%
message count 10000000, hash length 5, collisions 25.755240%
message count 10000000, hash length 6, collisions 0.928050%
message count 10000000, hash length 7, collisions 0.029860%
message count 10000000, hash length 8, collisions 0.000800%
message count 10000000, hash length 9, collisions 0.000060%
message count 10000000, hash length 10, collisions 0.000000%

===

#!/usr/bin/python
import random
def compute(message_count, hashlength):
  database = {}
  for i in range(message_count):
    n = random.randint(0, pow(2, 5 * hashlength))
    if n in database:
      database[n] += 1
    else:
      database[n] = 1
  collisions = 0
  for i in database:
    if database[i] > 1:
      collisions += database[i]
  p1 = (100.0 * collisions) / float(message_count)
  print("message count %d, hash length %d, collisions %f%%" %
        (message_count, hashlength, p1))

for i in range(4, 11):
  compute(10000000, i)
_______________________________________________
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Reply via email to