Martin McClure <[email protected]> writes:

> Where a hash comes in is if you want the identifiers generated in
> different places to be the *same* if the content being identified is
> the same -- you hash the content, and the resulting hash is the
> identifier. If the identifiers must also be unique, it's important to
> use a strong cryptographic hash. These are designed so that you can't
> get collisions even if you know how they work and try really hard, so
> they have good uniqueness properties.

There is another alternative, which is to use hash functions with high
continuity[1] and embrace collisions. For example, the sparse
distributed representations used by NuPIC[2] are basically hashes which
attempt to work semantically (interpreting the data) rather than
syntactically (treating the data as one huge int).

This makes it likely that values with colliding hashes have the same
'meaning', and those with similar hashes (eg. low edit distance) will
have similar 'meaning'. This could overcome issues like re-encoding
audio mentioned in the previous thread.

The key to this approach is that it hand-waves all of the complexity
into a magical hash function. In reality, hash functions which derive
meaning from data will be limited in what they can spot and will always
be domain-specific.

This even applies to human senses: given two arbitrary files, we can
only compare them in a limited number of ways. When our statistical
tests can't spot similarities we might try sending them to imagemagick
in case they're images of the same object, we might send them to VLC in
case they're different encodings of the same audio, etc. but we'll
always miss something. For example they might be turn out to be the same
text saved in OOXML and ODF formats. Of course, these examples assume
that the files are complete, valid files in some particular format;
if we only have a fraction of a complete file, we're out of luck with
these tools.

[1] http://en.wikipedia.org/wiki/Hash_function#Continuity
[2] http://numenta.org/

Cheers,
Chris
_______________________________________________
fonc mailing list
[email protected]
http://vpri.org/mailman/listinfo/fonc

Reply via email to