On Sat, 17 Jan 2004, Volker Kuhlmann wrote:
...
> It's a hash, and you can see easily that any number of bytes of input
> are transformed into 32 bytes of output. From this one can conclude
> that there have to be different files (of possibly vastly different
> size?) which transform into the same hash value. Think "number" space.

Yes, that is what I was referring to. But for files of size greater
than the length of the md5sum, there have to be different files of the
same size generating the same md5sum. Take, e.g., files of size 33
bytes. Then you will have at least 256 times as many different files
as you have different md5sum's. So there must be at least one pair of
different files of size 33 with identical md5sum. And so on... the
longer the files, the more of them have to have identical md5sums. But
practically this should not be much of a problem, because the files we
use are normally useful and thus have a certain format, so that they
have common parts which do not differ. Hence, with increasing file
size, we are practically exploiting decreasingly small percentages of
the available file space. Example for this hypothesis: how many
different one-byte files would be practically used? I guess 256. How
man different two-byte files? 65536. How many different files of size
1 GB? Well, theoretically there could be 256^(10^9), but this number
is possibly greater than the number of atoms on Earth - could someone
check, please? But there are files of this size. So we can only use a
really small fraction of the whole space (at a time).

As long as the total number of files we are looking at is well below
the number of different md5sum's, you would really have to be
(un-)lucky to see two different files with the same md5sum. But it
_can_ happen!

> The property that only a small change of input results in a large
> change of output makes it interesting for checksumming, because it
> reduces the probability that 2 similar bit errors in the data cancel
> each other out, i.e. produce the same hash. That's what I understand
> anyway.

Yes, they call it "confusion" and "diffusion"...

Kind regards,

Helmut.

+----------------+
| Helmut Walle   |
| [EMAIL PROTECTED] |
| 03 - 388 39 54 |
+----------------+

Reply via email to