On Tuesday, April 18, Larry Jones wrote:
> Tobias Weingartner writes:
> >
> > Unless you can point me at a definite article that explains the coding
> > and complexity theory behind using 2 different algorithms, and that
> > proves that it actually does reduce the chance of errors, I'm going to
> > say, that the *BEST* you can do, is as well as a single algorithm with
> > N+M bits worth of a sum. In other words, I'm sure I can replace the
> > 2 algorithms with 1 having the same "chance of errors" properties.
>
> That's true, but it's still much better than a sinble N-bit sum, and
> applying it to fixed-size blocks (which are presumably much smaller than
> the average file size) also reduces the chance of error significantly
> since you've significantly reduced the number of possible inputs.
I'm not so sure. Assuming the "sum" is done right, I seriously doubt
you would see much difference in how it behaves with different lengths
of input. In other words, there is a "complexity" trade-off that the
hashes are utilizing. In some sense, you can think of the hash as an
algorithmic way to turn something with a low complexity (low information
content per bit) content into a "randomized" high complexity (high
information content per bit) content. Granted, unless we are talking
about lossless compression, that conversion tends to be one-way, and does
tend to loose some of the "information content" of the original.
However, this is where the design of the hash algorithm comes into play,
this loss is usually minimized. Usually far enough that the possibility
of it actually happening are no worse than the rest of the CVS software
doing something else wrong... :-)
> > Also, the rsync algorithm does not help much here. It makes a lot of
> > sense in a "transmition" scenario. In a "checking" of "checksum"
> > scenario, it makes less sense. In some sense, checksums are meant to
> > be "fixed size" representations of a file. Quick to look up, quick to
> > manage, compare, etc.
>
> I guess I wasn't clear -- I meant that it should be used to send the
> (possibly) modified file to the server, not that it should be used
> simply to determine whether the file was suspected of changing.
Ahh, this side of the argument has been hashed (no pun intended) over
before. Yes, it would have some benefits, however, to implement the
rsync algorithm for CVS client-server communication would likely be
a rather large, and maybe even non-backwards compatible extension to
the current protocol. Of course, if someone wishes to prove me wrong,
I'd be delighted. :-)
--Toby.