Re: [fonc] hashes as names

2013-09-27 Thread Robbert van Dalen
Hi,

ZFS has de-duplication built on top of SHA256 hashes.
If the verify option is also enabled, it is possible for ZFS to work around 
detect hash collisions (although this option slows things down further).

But ZFS can be considered as a kind of 'central authority' so its 
de-duplication scheme may not work in a distributed setting.

Regards,
Robbert.

On Sep 27, 2013, at 2:50 AM, Wolfgang Eder e...@generalmagic.at wrote:

 hi,
 in recent discussions on this list, the idea of using hashes to
 identify or even name things is often mentioned.
 in this context, hashes are treated as being unique;
 
 albeit unlikely, it *is* possible that hashes are equal
 for two distinct things. are there ideas about
 how to handle such a situation?
 
 thanks and kind regards
 wolfgang
 ___
 fonc mailing list
 fonc@vpri.org
 http://vpri.org/mailman/listinfo/fonc

___
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc


Re: [fonc] hashes as names

2013-09-27 Thread Chris Warburton
Martin McClure martin.mccl...@gemtalksystems.com writes:

 Where a hash comes in is if you want the identifiers generated in
 different places to be the *same* if the content being identified is
 the same -- you hash the content, and the resulting hash is the
 identifier. If the identifiers must also be unique, it's important to
 use a strong cryptographic hash. These are designed so that you can't
 get collisions even if you know how they work and try really hard, so
 they have good uniqueness properties.

There is another alternative, which is to use hash functions with high
continuity[1] and embrace collisions. For example, the sparse
distributed representations used by NuPIC[2] are basically hashes which
attempt to work semantically (interpreting the data) rather than
syntactically (treating the data as one huge int).

This makes it likely that values with colliding hashes have the same
'meaning', and those with similar hashes (eg. low edit distance) will
have similar 'meaning'. This could overcome issues like re-encoding
audio mentioned in the previous thread.

The key to this approach is that it hand-waves all of the complexity
into a magical hash function. In reality, hash functions which derive
meaning from data will be limited in what they can spot and will always
be domain-specific.

This even applies to human senses: given two arbitrary files, we can
only compare them in a limited number of ways. When our statistical
tests can't spot similarities we might try sending them to imagemagick
in case they're images of the same object, we might send them to VLC in
case they're different encodings of the same audio, etc. but we'll
always miss something. For example they might be turn out to be the same
text saved in OOXML and ODF formats. Of course, these examples assume
that the files are complete, valid files in some particular format;
if we only have a fraction of a complete file, we're out of luck with
these tools.

[1] http://en.wikipedia.org/wiki/Hash_function#Continuity
[2] http://numenta.org/

Cheers,
Chris
___
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc


Re: [fonc] hashes as names

2013-09-27 Thread David Barbour
The usual problem with this sort of handle this super-rare event when it
happens code is that it is poorly tested in practice. Should you trust it?


On Thu, Sep 26, 2013 at 11:46 PM, Robbert van Dalen 
robbert.van.da...@gmail.com wrote:

 Hi,

 ZFS has de-duplication built on top of SHA256 hashes.
 If the verify option is also enabled, it is possible for ZFS to work
 around detect hash collisions (although this option slows things down
 further).

 But ZFS can be considered as a kind of 'central authority' so its
 de-duplication scheme may not work in a distributed setting.

 Regards,
 Robbert.

 On Sep 27, 2013, at 2:50 AM, Wolfgang Eder e...@generalmagic.at wrote:

  hi,
  in recent discussions on this list, the idea of using hashes to
  identify or even name things is often mentioned.
  in this context, hashes are treated as being unique;
 
  albeit unlikely, it *is* possible that hashes are equal
  for two distinct things. are there ideas about
  how to handle such a situation?
 
  thanks and kind regards
  wolfgang
  ___
  fonc mailing list
  fonc@vpri.org
  http://vpri.org/mailman/listinfo/fonc

 ___
 fonc mailing list
 fonc@vpri.org
 http://vpri.org/mailman/listinfo/fonc

___
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc


Re: [fonc] hashes as names

2013-09-27 Thread David Barbour
I love these sorts of hashes. I call them fingerprints as in audio
fingerprints or visual fingerprints.  They're extremely useful for
augmented reality applications, for physical security applications, and
many others. Even better if fingerprints from multiple sources can be
combined.

Aliasing is always a major issue - e.g. these hashes should also be robust
to simple translation and motion (e.g. sounds are slightly different if
you're moving towards or away; faces are slightly different when caught at
an angle). It seems to me that ML-based approaches to turning features into
vectors are among the better ways to approach these hashes.


On Fri, Sep 27, 2013 at 2:21 AM, Chris Warburton
chriswa...@googlemail.comwrote:

 Martin McClure martin.mccl...@gemtalksystems.com writes:

  Where a hash comes in is if you want the identifiers generated in
  different places to be the *same* if the content being identified is
  the same -- you hash the content, and the resulting hash is the
  identifier. If the identifiers must also be unique, it's important to
  use a strong cryptographic hash. These are designed so that you can't
  get collisions even if you know how they work and try really hard, so
  they have good uniqueness properties.

 There is another alternative, which is to use hash functions with high
 continuity[1] and embrace collisions. For example, the sparse
 distributed representations used by NuPIC[2] are basically hashes which
 attempt to work semantically (interpreting the data) rather than
 syntactically (treating the data as one huge int).

 This makes it likely that values with colliding hashes have the same
 'meaning', and those with similar hashes (eg. low edit distance) will
 have similar 'meaning'. This could overcome issues like re-encoding
 audio mentioned in the previous thread.

 The key to this approach is that it hand-waves all of the complexity
 into a magical hash function. In reality, hash functions which derive
 meaning from data will be limited in what they can spot and will always
 be domain-specific.

 This even applies to human senses: given two arbitrary files, we can
 only compare them in a limited number of ways. When our statistical
 tests can't spot similarities we might try sending them to imagemagick
 in case they're images of the same object, we might send them to VLC in
 case they're different encodings of the same audio, etc. but we'll
 always miss something. For example they might be turn out to be the same
 text saved in OOXML and ODF formats. Of course, these examples assume
 that the files are complete, valid files in some particular format;
 if we only have a fraction of a complete file, we're out of luck with
 these tools.

 [1] http://en.wikipedia.org/wiki/Hash_function#Continuity
 [2] http://numenta.org/

 Cheers,
 Chris
 ___
 fonc mailing list
 fonc@vpri.org
 http://vpri.org/mailman/listinfo/fonc

___
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc


Re: [fonc] hashes as names

2013-09-27 Thread Sam Putman
On Thu, Sep 26, 2013 at 5:50 PM, Wolfgang Eder e...@generalmagic.at wrote:

 hi,
 in recent discussions on this list, the idea of using hashes to
 identify or even name things is often mentioned.
 in this context, hashes are treated as being unique;

 albeit unlikely, it *is* possible that hashes are equal
 for two distinct things. are there ideas about
 how to handle such a situation?

 thanks and kind regards
 wolfgang


I would consider the correct response to a hash collision to be roughly
thus:

1) publish your collision far and wide

2) break out the champagne

3) celebrate having lived to see an event that needn't have ever happened,
and surely never will again.
___
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc


Re: [fonc] hashes as names

2013-09-27 Thread Robbert van Dalen
I believe such code can be tested in practice with great confidence.
If I would test such kind of code, I would replace the SHA256 code with a rogue 
version that emits equal hashes for certain bit patterns.

As a side note, I also don't trust my macbook's 16 gig of internal computer 
memory - there is a (rather big) chance that bit-errors will silently corrupt 
state.
Even EEC memory suffers that fate (but of course with a much lower chance).

It is impossible to build a system that achieves 100% data correctness: SHA256 
will do fine for now.

On Sep 27, 2013, at 4:37 PM, David Barbour dmbarb...@gmail.com wrote:

 The usual problem with this sort of handle this super-rare event when it 
 happens code is that it is poorly tested in practice. Should you trust it?
 
 
 On Thu, Sep 26, 2013 at 11:46 PM, Robbert van Dalen 
 robbert.van.da...@gmail.com wrote:
 Hi,
 
 ZFS has de-duplication built on top of SHA256 hashes.
 If the verify option is also enabled, it is possible for ZFS to work around 
 detect hash collisions (although this option slows things down further).
 
 But ZFS can be considered as a kind of 'central authority' so its 
 de-duplication scheme may not work in a distributed setting.
 
 Regards,
 Robbert.
 
 On Sep 27, 2013, at 2:50 AM, Wolfgang Eder e...@generalmagic.at wrote:
 
  hi,
  in recent discussions on this list, the idea of using hashes to
  identify or even name things is often mentioned.
  in this context, hashes are treated as being unique;
 
  albeit unlikely, it *is* possible that hashes are equal
  for two distinct things. are there ideas about
  how to handle such a situation?
 
  thanks and kind regards
  wolfgang
  ___
  fonc mailing list
  fonc@vpri.org
  http://vpri.org/mailman/listinfo/fonc
 
 ___
 fonc mailing list
 fonc@vpri.org
 http://vpri.org/mailman/listinfo/fonc
 
 ___
 fonc mailing list
 fonc@vpri.org
 http://vpri.org/mailman/listinfo/fonc

___
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc


Re: [fonc] hashes as names

2013-09-27 Thread David Barbour
You can test in a test environment, sure. But I think that doesn't qualify
as in practice, unless you leave your rogue code in.

I agree that 256-bit hash is unlikely to have accidental collisions
compared to other causes of error (see my first message on this subject). I
don't really trust it has a full 256-bits of security against intentional
collisions; almost every 'secure' hash has been at least partially
compromised.



On Fri, Sep 27, 2013 at 11:33 AM, Robbert van Dalen 
robbert.van.da...@gmail.com wrote:

 I believe such code can be tested in practice with great confidence.
 If I would test such kind of code, I would replace the SHA256 code with a
 rogue version that emits equal hashes for certain bit patterns.

 As a side note, I also don't trust my macbook's 16 gig of internal
 computer memory - there is a (rather big) chance that bit-errors will
 silently corrupt state.
 Even EEC memory suffers that fate (but of course with a much lower chance).

 It is impossible to build a system that achieves 100% data correctness:
 SHA256 will do fine for now.

 On Sep 27, 2013, at 4:37 PM, David Barbour dmbarb...@gmail.com wrote:

  The usual problem with this sort of handle this super-rare event when
 it happens code is that it is poorly tested in practice. Should you trust
 it?
 
 
  On Thu, Sep 26, 2013 at 11:46 PM, Robbert van Dalen 
 robbert.van.da...@gmail.com wrote:
  Hi,
 
  ZFS has de-duplication built on top of SHA256 hashes.
  If the verify option is also enabled, it is possible for ZFS to work
 around detect hash collisions (although this option slows things down
 further).
 
  But ZFS can be considered as a kind of 'central authority' so its
 de-duplication scheme may not work in a distributed setting.
 
  Regards,
  Robbert.
 
  On Sep 27, 2013, at 2:50 AM, Wolfgang Eder e...@generalmagic.at wrote:
 
   hi,
   in recent discussions on this list, the idea of using hashes to
   identify or even name things is often mentioned.
   in this context, hashes are treated as being unique;
  
   albeit unlikely, it *is* possible that hashes are equal
   for two distinct things. are there ideas about
   how to handle such a situation?
  
   thanks and kind regards
   wolfgang
   ___

___
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc