Re: [fonc] hashes as names
Hi, ZFS has de-duplication built on top of SHA256 hashes. If the verify option is also enabled, it is possible for ZFS to work around detect hash collisions (although this option slows things down further). But ZFS can be considered as a kind of 'central authority' so its de-duplication scheme may not work in a distributed setting. Regards, Robbert. On Sep 27, 2013, at 2:50 AM, Wolfgang Eder e...@generalmagic.at wrote: hi, in recent discussions on this list, the idea of using hashes to identify or even name things is often mentioned. in this context, hashes are treated as being unique; albeit unlikely, it *is* possible that hashes are equal for two distinct things. are there ideas about how to handle such a situation? thanks and kind regards wolfgang ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc
Re: [fonc] hashes as names
Martin McClure martin.mccl...@gemtalksystems.com writes: Where a hash comes in is if you want the identifiers generated in different places to be the *same* if the content being identified is the same -- you hash the content, and the resulting hash is the identifier. If the identifiers must also be unique, it's important to use a strong cryptographic hash. These are designed so that you can't get collisions even if you know how they work and try really hard, so they have good uniqueness properties. There is another alternative, which is to use hash functions with high continuity[1] and embrace collisions. For example, the sparse distributed representations used by NuPIC[2] are basically hashes which attempt to work semantically (interpreting the data) rather than syntactically (treating the data as one huge int). This makes it likely that values with colliding hashes have the same 'meaning', and those with similar hashes (eg. low edit distance) will have similar 'meaning'. This could overcome issues like re-encoding audio mentioned in the previous thread. The key to this approach is that it hand-waves all of the complexity into a magical hash function. In reality, hash functions which derive meaning from data will be limited in what they can spot and will always be domain-specific. This even applies to human senses: given two arbitrary files, we can only compare them in a limited number of ways. When our statistical tests can't spot similarities we might try sending them to imagemagick in case they're images of the same object, we might send them to VLC in case they're different encodings of the same audio, etc. but we'll always miss something. For example they might be turn out to be the same text saved in OOXML and ODF formats. Of course, these examples assume that the files are complete, valid files in some particular format; if we only have a fraction of a complete file, we're out of luck with these tools. [1] http://en.wikipedia.org/wiki/Hash_function#Continuity [2] http://numenta.org/ Cheers, Chris ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc
Re: [fonc] hashes as names
The usual problem with this sort of handle this super-rare event when it happens code is that it is poorly tested in practice. Should you trust it? On Thu, Sep 26, 2013 at 11:46 PM, Robbert van Dalen robbert.van.da...@gmail.com wrote: Hi, ZFS has de-duplication built on top of SHA256 hashes. If the verify option is also enabled, it is possible for ZFS to work around detect hash collisions (although this option slows things down further). But ZFS can be considered as a kind of 'central authority' so its de-duplication scheme may not work in a distributed setting. Regards, Robbert. On Sep 27, 2013, at 2:50 AM, Wolfgang Eder e...@generalmagic.at wrote: hi, in recent discussions on this list, the idea of using hashes to identify or even name things is often mentioned. in this context, hashes are treated as being unique; albeit unlikely, it *is* possible that hashes are equal for two distinct things. are there ideas about how to handle such a situation? thanks and kind regards wolfgang ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc
Re: [fonc] hashes as names
I love these sorts of hashes. I call them fingerprints as in audio fingerprints or visual fingerprints. They're extremely useful for augmented reality applications, for physical security applications, and many others. Even better if fingerprints from multiple sources can be combined. Aliasing is always a major issue - e.g. these hashes should also be robust to simple translation and motion (e.g. sounds are slightly different if you're moving towards or away; faces are slightly different when caught at an angle). It seems to me that ML-based approaches to turning features into vectors are among the better ways to approach these hashes. On Fri, Sep 27, 2013 at 2:21 AM, Chris Warburton chriswa...@googlemail.comwrote: Martin McClure martin.mccl...@gemtalksystems.com writes: Where a hash comes in is if you want the identifiers generated in different places to be the *same* if the content being identified is the same -- you hash the content, and the resulting hash is the identifier. If the identifiers must also be unique, it's important to use a strong cryptographic hash. These are designed so that you can't get collisions even if you know how they work and try really hard, so they have good uniqueness properties. There is another alternative, which is to use hash functions with high continuity[1] and embrace collisions. For example, the sparse distributed representations used by NuPIC[2] are basically hashes which attempt to work semantically (interpreting the data) rather than syntactically (treating the data as one huge int). This makes it likely that values with colliding hashes have the same 'meaning', and those with similar hashes (eg. low edit distance) will have similar 'meaning'. This could overcome issues like re-encoding audio mentioned in the previous thread. The key to this approach is that it hand-waves all of the complexity into a magical hash function. In reality, hash functions which derive meaning from data will be limited in what they can spot and will always be domain-specific. This even applies to human senses: given two arbitrary files, we can only compare them in a limited number of ways. When our statistical tests can't spot similarities we might try sending them to imagemagick in case they're images of the same object, we might send them to VLC in case they're different encodings of the same audio, etc. but we'll always miss something. For example they might be turn out to be the same text saved in OOXML and ODF formats. Of course, these examples assume that the files are complete, valid files in some particular format; if we only have a fraction of a complete file, we're out of luck with these tools. [1] http://en.wikipedia.org/wiki/Hash_function#Continuity [2] http://numenta.org/ Cheers, Chris ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc
Re: [fonc] hashes as names
On Thu, Sep 26, 2013 at 5:50 PM, Wolfgang Eder e...@generalmagic.at wrote: hi, in recent discussions on this list, the idea of using hashes to identify or even name things is often mentioned. in this context, hashes are treated as being unique; albeit unlikely, it *is* possible that hashes are equal for two distinct things. are there ideas about how to handle such a situation? thanks and kind regards wolfgang I would consider the correct response to a hash collision to be roughly thus: 1) publish your collision far and wide 2) break out the champagne 3) celebrate having lived to see an event that needn't have ever happened, and surely never will again. ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc
Re: [fonc] hashes as names
I believe such code can be tested in practice with great confidence. If I would test such kind of code, I would replace the SHA256 code with a rogue version that emits equal hashes for certain bit patterns. As a side note, I also don't trust my macbook's 16 gig of internal computer memory - there is a (rather big) chance that bit-errors will silently corrupt state. Even EEC memory suffers that fate (but of course with a much lower chance). It is impossible to build a system that achieves 100% data correctness: SHA256 will do fine for now. On Sep 27, 2013, at 4:37 PM, David Barbour dmbarb...@gmail.com wrote: The usual problem with this sort of handle this super-rare event when it happens code is that it is poorly tested in practice. Should you trust it? On Thu, Sep 26, 2013 at 11:46 PM, Robbert van Dalen robbert.van.da...@gmail.com wrote: Hi, ZFS has de-duplication built on top of SHA256 hashes. If the verify option is also enabled, it is possible for ZFS to work around detect hash collisions (although this option slows things down further). But ZFS can be considered as a kind of 'central authority' so its de-duplication scheme may not work in a distributed setting. Regards, Robbert. On Sep 27, 2013, at 2:50 AM, Wolfgang Eder e...@generalmagic.at wrote: hi, in recent discussions on this list, the idea of using hashes to identify or even name things is often mentioned. in this context, hashes are treated as being unique; albeit unlikely, it *is* possible that hashes are equal for two distinct things. are there ideas about how to handle such a situation? thanks and kind regards wolfgang ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc
Re: [fonc] hashes as names
You can test in a test environment, sure. But I think that doesn't qualify as in practice, unless you leave your rogue code in. I agree that 256-bit hash is unlikely to have accidental collisions compared to other causes of error (see my first message on this subject). I don't really trust it has a full 256-bits of security against intentional collisions; almost every 'secure' hash has been at least partially compromised. On Fri, Sep 27, 2013 at 11:33 AM, Robbert van Dalen robbert.van.da...@gmail.com wrote: I believe such code can be tested in practice with great confidence. If I would test such kind of code, I would replace the SHA256 code with a rogue version that emits equal hashes for certain bit patterns. As a side note, I also don't trust my macbook's 16 gig of internal computer memory - there is a (rather big) chance that bit-errors will silently corrupt state. Even EEC memory suffers that fate (but of course with a much lower chance). It is impossible to build a system that achieves 100% data correctness: SHA256 will do fine for now. On Sep 27, 2013, at 4:37 PM, David Barbour dmbarb...@gmail.com wrote: The usual problem with this sort of handle this super-rare event when it happens code is that it is poorly tested in practice. Should you trust it? On Thu, Sep 26, 2013 at 11:46 PM, Robbert van Dalen robbert.van.da...@gmail.com wrote: Hi, ZFS has de-duplication built on top of SHA256 hashes. If the verify option is also enabled, it is possible for ZFS to work around detect hash collisions (although this option slows things down further). But ZFS can be considered as a kind of 'central authority' so its de-duplication scheme may not work in a distributed setting. Regards, Robbert. On Sep 27, 2013, at 2:50 AM, Wolfgang Eder e...@generalmagic.at wrote: hi, in recent discussions on this list, the idea of using hashes to identify or even name things is often mentioned. in this context, hashes are treated as being unique; albeit unlikely, it *is* possible that hashes are equal for two distinct things. are there ideas about how to handle such a situation? thanks and kind regards wolfgang ___ ___ fonc mailing list fonc@vpri.org http://vpri.org/mailman/listinfo/fonc