James Cloos wrote:
"Jani" == Jani Partanen <[EMAIL PROTECTED]> writes:
Jani> Every time when you hash something what is bigger than your
Jani> returned hash, there can be collision.
The odds of a collision with a cryptographically strong hash are
infinitesimal.
Anybody who claims that does not understand the mathematics
behind the birthday paradox (http://en.wikipedia.org/wiki/Birthday_paradox)
How many people do you need to have in a room before there is a 50/50
chance that two will have the same birthday? Answer is 23. You want
99.9% probability?
That number is around 60!
365 days, 60 people, and you have a virtually guaranteed collision.
Think of people in the birthday paradox as files, and the birthdays as
the checksum,
and you will see that collisions (in a mail system with thousands of
stored attachments) are more likely than you thought.
Sha1 is not perfect, but even if its actual strength is closer to say 72
bits than its theoretical maximum strength (which is 80 bits for a 160
bit long hash), the odds of a collision are two small to worry about.
You still talking about one chance in on the order of 100000000000000000000.
The sha2 family does provide more bits. And whirlpool¹ seems to have
consensus as the “best” choice currently available. But sha1 is still
good enough for general use.
Sha1 and other cryptographically strong hashes are designed to make
very obvious the difference between two data sources that appear similar.
They are *NOT* designed to detect the difference between files that are
"obviously different".
This is fortunate, because mathematically it is easy to prove that, in
general, they can NOT do that.
If the length of the hash is 20 bytes, and the maximum length of the
attachment is 20MB, then
the number of different attachments that map to the same value (even if
you use an IDEAL hash function) is in the millions. If you double the
size of hash (or use two different hashes), you are only halving the
number of different attachments that will map to the same hash value.
And because of the birthday paradox, that doesn't lower the probability
of collision as much as you'd at first think.
Unfortunately, if you want to be assured of a hash function with no
collisions, your hash has to be the same size as the maximum allowed
attachment - which is impractical.
Please, read the wikipedia entry on the birthday paradox, as well as
the entry on hash tables, http://en.wikipedia.org/wiki/Hash_table
The collisions will be rare, and I don't think that the byte-by-byte
comparison when you get a hit will add significantly to the load on the
server - but if you don't implement it, you will get bitten by a
collision sooner or later.
_______________________________________________
DBmail mailing list
[email protected]
https://mailman.fastxs.nl/mailman/listinfo/dbmail