I apologize, the simulation code had a flaw. I'm embarrassed that I didn't immediately recognize this immediately from intuition. We could get even more accurate results by computing actual SHA-1 of actual message-ids, but I'm not sure it is worth the effort. Here is a revised program and the results from one run. For the record, I'm interested in billion+ message counts, for which we'd need a few more hash characters. But not that many.
message count 10000000, hash length 4, collisions 99.992930% message count 10000000, hash length 5, collisions 25.755240% message count 10000000, hash length 6, collisions 0.928050% message count 10000000, hash length 7, collisions 0.029860% message count 10000000, hash length 8, collisions 0.000800% message count 10000000, hash length 9, collisions 0.000060% message count 10000000, hash length 10, collisions 0.000000% === #!/usr/bin/python import random def compute(message_count, hashlength): database = {} for i in range(message_count): n = random.randint(0, pow(2, 5 * hashlength)) if n in database: database[n] += 1 else: database[n] = 1 collisions = 0 for i in database: if database[i] > 1: collisions += database[i] p1 = (100.0 * collisions) / float(message_count) print("message count %d, hash length %d, collisions %f%%" % (message_count, hashlength, p1)) for i in range(4, 11): compute(10000000, i) _______________________________________________ Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://wiki.list.org/x/AgA3 Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org Security Policy: http://wiki.list.org/x/QIA9