On Feb 17, 2011, at 07:56, John P. Hartmann wrote:
> Below is an EXEC I wrote to check whether the 32-bit CRC is really
> unique. As it does a SORT UNIQUE, there are limits to the size of
> files that can be checked.
>
This empirical approach is much supplemented by the theory:
http://en.wikipedia.org/wiki/Birthday_attack
In the simplest computation, at about 2^16 samples the probability of
a collision rises to about 50%. This ought to be well within reach
of your test.
I'm confident that MD5 is better than CRC-32, and MD5 is now deprecated.
SHA-1, etc. are better. Some Chinese mathematicians have a proof that
with about 2^69 carefully chosen samples a collision becomes likely in
SHA-1. I don't know that an actual collision has been exhibited.
Somewhere I saw IBM's description of its SuperC. IIRC, it begins
by computing a hash of each record.
Are you considering writing a compare stage? It would be difficult
to do better than, or even match SuperC. But a few years ago I asked
about a compare stage. I need only an identical/different indication.
I'm using COMPARE MODULE (I think it's called).
-- gil