On 14.11.2012 01:41, Richard Baron Penman wrote:
> I found the MD5 and SHA hashes slow to calculate.

Slow? For URLs? Are you kidding? How many URLs per second do you want to
calculate?

> The builtin hash is fast but I was concerned about collisions. What
> rate of collisions could I expect?

MD5 has 16 bytes (128 bit), SHA1 has 20 bytes (160 bit). Utilizing the
birthday paradox and some approximations, I can tell you that when using
the full MD5 you'd need around 2.609e16 hashes in the same namespace to
get a one in a million chance of a collision. That is, 26090000000000000
filenames.

For SHA1 This number rises even further and you'd need around 1.71e21 or
1710000000000000000000 hashes in one namespace for the one-in-a-million.

I really have no clue about how many URLs you want to hash, and it seems
to be LOTS since the speed of MD5 seems to be an issue for you. Let me
estimate that you'd want to calculate a million hashes per second then
when you use MD5, you'd have about 827 years to fill the namespace up
enough to get a one-in-a-million.

If you need even more hashes (say a million million per second), I'd
suggest you go with SHA-1, giving you 54 years to get the one-in-a-million.

Then again, if you went for a million million hashes per second, Python
would probably not be the language of your choice.

Best regards,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1...@speranza.aioe.org>
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to