Martin McClure <[email protected]> writes: > Where a hash comes in is if you want the identifiers generated in > different places to be the *same* if the content being identified is > the same -- you hash the content, and the resulting hash is the > identifier. If the identifiers must also be unique, it's important to > use a strong cryptographic hash. These are designed so that you can't > get collisions even if you know how they work and try really hard, so > they have good uniqueness properties.
There is another alternative, which is to use hash functions with high continuity[1] and embrace collisions. For example, the sparse distributed representations used by NuPIC[2] are basically hashes which attempt to work semantically (interpreting the data) rather than syntactically (treating the data as one huge int). This makes it likely that values with colliding hashes have the same 'meaning', and those with similar hashes (eg. low edit distance) will have similar 'meaning'. This could overcome issues like re-encoding audio mentioned in the previous thread. The key to this approach is that it hand-waves all of the complexity into a magical hash function. In reality, hash functions which derive meaning from data will be limited in what they can spot and will always be domain-specific. This even applies to human senses: given two arbitrary files, we can only compare them in a limited number of ways. When our statistical tests can't spot similarities we might try sending them to imagemagick in case they're images of the same object, we might send them to VLC in case they're different encodings of the same audio, etc. but we'll always miss something. For example they might be turn out to be the same text saved in OOXML and ODF formats. Of course, these examples assume that the files are complete, valid files in some particular format; if we only have a fraction of a complete file, we're out of luck with these tools. [1] http://en.wikipedia.org/wiki/Hash_function#Continuity [2] http://numenta.org/ Cheers, Chris _______________________________________________ fonc mailing list [email protected] http://vpri.org/mailman/listinfo/fonc
