> >From the MD5 RFC:
> 
>   http://www.csl.sony.co.jp/cgi-bin/hyperrfc?rfc1321.txt
> 
> % It is conjectured that it is computationally infeasible to produce
> % two messages having the same message digest, or to produce any
> % message having a given prespecified target message digest.
> 
> You can be assured that using MD5, there is ZERO possibility CVS
> could be fooled into thinking the file hadn't changed, if it
> actually had.
>
> > it is not only possible but necessary that many files
> > have the same checksum (otherwise you'd have one really great
> > compression method).
> 
> Compression methods are reversible:  You have to be able
> to get out what you put in.  Digests such as MD5 have no such
> requirement, so please don't compare the two.

Look at information theory for a moment. MD5 turns a file of arbitrary
size into a single number, that I believe is 128 bits. Regardless of WHAT
the size of the output is, if you have 2^128 different outputs, then as
soon as you have 2^128 + 1 different inputs, you must have a duplicate.

Now, it may be that given an original message, it is hard to produce
another message with the same output, or it may be hard to produce any
message with a predetermined output, the bottom line is that it must be
possible. That you have a one way function, that does NOT need to go back
to the original, means that you may not be able to, because there are
more than one originals that will go to the same item.

Think of it this way:
Ascii is only about 96 or so different items out of 256. Source code
compresses to about 1/4 or 1/5 of the orignal (with a really good
compression system) because
A) It's ascii (1/3 right away)
B) it's repetetive -- an 'f' is often followed by 'o', and 'fo' is almost
always followed by 'r', so the 'for' sequence encodes much less than
three bytes worth of information. A very long identifier
(EODatabaseDataSource, NSMutableDictionary, etc) can be encoded down to a
single token. Etc.

The ideal compression program will determine how much actual information,
not repetition, is in the source, and output only that. That is the
smallest non-duplicating output you can get.

Huffman coding ("average length was short as it could be./ Huffman puffed
and puffed until his face turned blue,/ but he still couldn't beat the
sum of p log p") does the best you can on byte length input tokens.
Compress (lzw) does the best you can by building up long tokens based on
what tokens have been seen before. Bzip's method I didn't understand, but
it was somehow based on sorting blocks of input text looking for
repetitions. Theoretically, a better system can read the entire file,
scan that for repeated tokens, and then output a huffman-style coding of
those token strings, but it would take a very long time to determine what
those ideal tokens are -- different programs try different ways to
approach that. (Note that spaces are not necessarily the ideal token
delimiter -- "<tab>for (i=0; i< " may be a token, as well as ";
i++)<CR><tab>{<CR><tab><tab>").

Any fixed length system must have duplicates. You cannot take a file that
is more than 128 bits (16 bytes) of random data (dd if=/dev/random bs=16
count=1), and turn that into a 128 bit number without duplication -- try
it 2^128 +1 times, and tell me if you duplicate or not.

MD5 *must* duplicate. It may never duplicate in practice; it may never
duplicate over the life of a single project. But if you are designing
aircraft software, you must be able to say 'we need to check every byte
for changes'.

Bottom line: MD5 checking can/should be switch enable-able, default on.
Timestamp checking can/should be switch enable-able, default on
Always check every byte can/should be switch enable-able, default off
(it is needed if you are in an environment where timestamps can be
changed or mis-maintained, such as an NT 4.0 sp 6 workstation box
smb-mounted onto a linux system, as I have -- timestamps on files are not
reliable, but at least are (seem to be) consistent.)

Reply via email to