RE: [firebird-support] Re: detect duplicate blobs, how to?

2017-02-09 Thread 'Richard Damon' rich...@damon-family.org [firebird-support]
Yes, it is a matter of scale, but it is also a matter of what you promise. If 
you are going to promise 100% reliable detection of duplicates, just using a 
hash (no matter how good) is a bug, because it ISN’T 100% reliable. It might be 
“Reliable Enough”, but don’t say it is 100%.

 

Please also review your math, as your argument actually proved its flaw. The 
number 2.71 x 10^18 ISN’T a chance of collision, or even a chance of any kind. 
It is the number of UUIDs (of 128 bits) that would need to be generated to have 
a 50% chance of collision. 

 

I would actually argue that the idea of using a very high quality hash, like 
SHA256 is taking the problem in the wrong direction. It is a somewhat expensive 
value to compute, and being 256 bits long (32 bytes) somewhat big to add, and 
index, for every file. I would be much more inclined to use a simple hash (like 
CRC) that gives maybe a 32 bit result, lookup the files that match (somewhat 
likely to have one) and do the bit by bit compare to see if it is a duplicate. 
That is probably faster than your method of computing the SHA256 and (possibly 
even if unlikely, incorrectly) assuming they are unique, the time saved in 
using a simpler hash, and quicker index, likely makes up for the time needed to 
compare the two files.

 

From: firebird-support@yahoogroups.com 
[mailto:firebird-support@yahoogroups.com] 
Sent: Thursday, February 9, 2017 6:40 PM
To: firebird-support@yahoogroups.com
Subject: RE: [firebird-support] Re: detect duplicate blobs, how to?

 

  

Richard, 

> Sean 
> Even SHA256 can’t eliminate all possibility of a duplicate. If you have files 
> of 
> more than 256 bits in them, by the pigeon hole principle, there WILL be 
> duplicates within the universe of all possible files. There HAS to be. The 
> probability is very low (but not zero) if you keep you set of files below 
> 2^128 
> members, but it is NOT 0. 

I agree that there is a possibility, but it is all about understanding scale. 

Consider, 2^128 is the same number space that is used to represent UUID and 
GUID values, and the changes of a collision is 2.71 x 10^18 
(https://en.wikipedia.org/wiki/Universally_unique_identifier ) 

Consider, the number of words ever spoken by human beings = 5 Exabytes (< 2^63) 
(http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-or-a-yottabyte.html),
 so what are the chances that more than 2^128 files have been created? 

Consider, the probability that a rogue asteroid crashes on Earth *within the 
next second*, obliterating civilization-as-we-know-it is about 10^-15. Which is 
45 **orders of magnitude** more probable than the SHA-256 collision. 
(http://stackoverflow.com/questions/4014090/is-it-safe-to-ignore-the-possibility-of-sha-collisions-in-practice)
 

IMO, the changes of a collision are for all practical purposes 0. 

But if you still think there is a chance, then use SHA512. 


Sean 

___



RE: [firebird-support] Re: detect duplicate blobs, how to?

2017-02-09 Thread 'Leyne, Sean' s...@broadviewsoftware.com [firebird-support]
Richard,

> Sean
> Even SHA256 can’t eliminate all possibility of a duplicate. If you have files 
> of
> more than 256 bits in them, by the pigeon hole principle, there WILL be
> duplicates within the universe of all possible files. There HAS to be. The
> probability is very low (but not zero) if you keep you set of files below 
> 2^128
> members, but it is NOT 0.

I agree that there is a possibility, but it is all about understanding scale.

Consider, 2^128 is the same number space that is used to represent UUID and 
GUID values, and the changes of a collision is 2.71 x 10^18 
(https://en.wikipedia.org/wiki/Universally_unique_identifier )

Consider, the number of words ever spoken by human beings = 5 Exabytes (< 2^63) 
(http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-or-a-yottabyte.html),
 so what are the chances that more than 2^128 files have been created?

Consider, the probability that a rogue asteroid crashes on Earth *within the 
next second*, obliterating civilization-as-we-know-it is about 10^-15.  Which 
is 45 **orders of magnitude** more probable than the SHA-256 collision. 
(http://stackoverflow.com/questions/4014090/is-it-safe-to-ignore-the-possibility-of-sha-collisions-in-practice)

IMO, the changes of a collision are for all practical purposes 0.

But if you still think there is a chance, then use SHA512.


Sean



RE: [firebird-support] Re: detect duplicate blobs, how to?

2017-02-09 Thread 'Richard Damon' rich...@damon-family.org [firebird-support]
 

 

From: firebird-support@yahoogroups.com 
[mailto:firebird-support@yahoogroups.com] 
Sent: Thursday, February 9, 2017 11:33 AM
To: firebird-support@yahoogroups.com
Subject: RE: [firebird-support] Re: detect duplicate blobs, how to?

 

  



> You are aware of course that you can't use any hashing function on its own to 
> detect duplicates? - the best you can do is detect *probable* duplicates, 

Actually, if you choose the right hash function you can detect duplicates. 

If you create a UDF based on/using SHA256, the result would be unique (with a 
2^256 certainty) -- there is no known collision of a SHA256 hash 
(https://en.wikipedia.org/wiki/Hash_function_security_summary). 


Sean 

Even SHA256 can’t eliminate all possibility of a duplicate. If you have files 
of more than 256 bits in them, by the pigeon hole principle, there WILL be 
duplicates within the universe of all possible files. There HAS to be. The 
probability is very low (but not zero) if you keep you set of files below 2^128 
members, but it is NOT 0. The key property of a has like SHA256 is that given a 
hash value, you cannot create a file (other than by brute force) that will 
yield that hash value. When using a hash, you need to decide if the chance of a 
false positive on match. With a good large hash, that probability gets very 
small, so maybe you can assume it is perfect. I would likely still compare the 
files to be sure, since you will likely only occur that cost if you do have a 
duplicate.   ,_._,___



RE: [firebird-support] Re: detect duplicate blobs, how to?

2017-02-09 Thread 'Leyne, Sean' s...@broadviewsoftware.com [firebird-support]


> You are aware of course that you can't use any hashing function on its own to
> detect duplicates? - the best you can do is detect *probable* duplicates,

Actually, if you choose the right hash function you can detect duplicates.

If you create a UDF based on/using SHA256, the result would be unique  (with a 
2^256 certainty) -- there is no known collision of a SHA256 hash 
(https://en.wikipedia.org/wiki/Hash_function_security_summary).


Sean



Re: [firebird-support] Re: detect duplicate blobs, how to?

2017-02-09 Thread Tim Ward t...@telensa.com [firebird-support]
You are aware of course that you can't use any hashing function on its 
own to detect duplicates? - the best you can do is detect *probable* 
duplicates, filtering out ones that are *definitely* different, in order 
to *reduce*, not eliminate, the amount of work you end up doing 
comparing byte by byte.


On 09/02/2017 14:39, hamacker sirhamac...@gmail.com [firebird-support] 
wrote:
Well, the man page says fully supports text BLOBs, I assume that 
binary not.
If HASH() behavior will operate ^Z and others control chars in Blob 
Image, then now I need to think how to use this function to kill 
duplications.


Any function like CHECKSUM, CRC32,...?

2017-02-09 10:36 GMT-02:00 Dmitry Yemanov dim...@users.sourceforge.net 
 [firebird-support] 
>:


09.02.2017 14:48, hamacker wrote:
>
> There is a HASH function but only for string.

It works for blobs as well.

Dmitry






--
Tim Ward



Re: [firebird-support] Re: detect duplicate blobs, how to?

2017-02-09 Thread hamacker sirhamac...@gmail.com [firebird-support]
Well, the man page says fully supports text BLOBs, I assume that binary not.
If HASH() behavior will operate ^Z and others control chars in Blob Image,
then now I need to think how to use this function to kill duplications.

Any function like CHECKSUM, CRC32,...?

2017-02-09 10:36 GMT-02:00 Dmitry Yemanov dim...@users.sourceforge.net
[firebird-support] :

>
>
> 09.02.2017 14:48, hamacker wrote:
> >
> > There is a HASH function but only for string.
>
> It works for blobs as well.
>
> Dmitry
>
> 
>