I love these sorts of hashes. I call them "fingerprints" as in "audio
fingerprints" or "visual fingerprints".  They're extremely useful for
augmented reality applications, for physical security applications, and
many others. Even better if fingerprints from multiple sources can be
combined.

Aliasing is always a major issue - e.g. these hashes should also be robust
to simple translation and motion (e.g. sounds are slightly different if
you're moving towards or away; faces are slightly different when caught at
an angle). It seems to me that ML-based approaches to turning features into
vectors are among the better ways to approach these hashes.


On Fri, Sep 27, 2013 at 2:21 AM, Chris Warburton
<[email protected]>wrote:

> Martin McClure <[email protected]> writes:
>
> > Where a hash comes in is if you want the identifiers generated in
> > different places to be the *same* if the content being identified is
> > the same -- you hash the content, and the resulting hash is the
> > identifier. If the identifiers must also be unique, it's important to
> > use a strong cryptographic hash. These are designed so that you can't
> > get collisions even if you know how they work and try really hard, so
> > they have good uniqueness properties.
>
> There is another alternative, which is to use hash functions with high
> continuity[1] and embrace collisions. For example, the sparse
> distributed representations used by NuPIC[2] are basically hashes which
> attempt to work semantically (interpreting the data) rather than
> syntactically (treating the data as one huge int).
>
> This makes it likely that values with colliding hashes have the same
> 'meaning', and those with similar hashes (eg. low edit distance) will
> have similar 'meaning'. This could overcome issues like re-encoding
> audio mentioned in the previous thread.
>
> The key to this approach is that it hand-waves all of the complexity
> into a magical hash function. In reality, hash functions which derive
> meaning from data will be limited in what they can spot and will always
> be domain-specific.
>
> This even applies to human senses: given two arbitrary files, we can
> only compare them in a limited number of ways. When our statistical
> tests can't spot similarities we might try sending them to imagemagick
> in case they're images of the same object, we might send them to VLC in
> case they're different encodings of the same audio, etc. but we'll
> always miss something. For example they might be turn out to be the same
> text saved in OOXML and ODF formats. Of course, these examples assume
> that the files are complete, valid files in some particular format;
> if we only have a fraction of a complete file, we're out of luck with
> these tools.
>
> [1] http://en.wikipedia.org/wiki/Hash_function#Continuity
> [2] http://numenta.org/
>
> Cheers,
> Chris
> _______________________________________________
> fonc mailing list
> [email protected]
> http://vpri.org/mailman/listinfo/fonc
>
_______________________________________________
fonc mailing list
[email protected]
http://vpri.org/mailman/listinfo/fonc

Reply via email to