Re: [Commons-l] Duplicate removal?

Jonas Öberg Thu, 04 Dec 2014 02:15:07 -0800

Hi James,

> They're very similar, though the smaller image is in fact sharper, a little
> darker, and slightly differently framed.


This wouldn't trigger any bells for us. They're too different for us
to, mathematically, say that they are similar without also triggering
a lot of false positives.

If we look at the hashes generated by our blockhash[1] algorithm for
those two images, we end up with this:

8000bc409f7c9ffd9cd096689fe883e4f3fd83c583c101e183e101e60073e7bf
80019c819ff99ff18cc1944197e19fe9f7e9c3c983c103c1a3e183ee217004ff

You can see that there is some commonality, but that they're also
quite far apart. If we convert this to bits and calculate the hamming
distance (the number of bits that differ) between the two, we end up
with a distance of 48 bits (out of 256). So far, we've found that a
maximum distance of 10 is usually sufficiently unique to be called a
match, though with the draft query for duplicate Commons worked that I
linked to, I've been even more restrictive and not allowed even 1 bit
to differ, just to get a better match for those that do match, at the
expense of not matching as many.

Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/commons-l

Re: [Commons-l] Duplicate removal?

Reply via email to