Re: [Commons-l] Duplicate removal?

Jonas Öberg Thu, 04 Dec 2014 11:55:10 -0800

Hi Fae,

> Listing identical duplicates with 2 or more files matching would be
> simpler but longer; at the moment I count 3,279 files like this on
> Commons which took over 9 minutes to run. :-)


This is very interesting. I had a closer look at our matches and it
seems that many of them are files where there are slight color
variations, or where the jpg has simply been compressed differently,
so a sha1 wouldn't mach them against each other. But that speaks in
favor of the fact that the matches we find need a human to validate
case by case. My Python script is still processing :-) but it's
currently recorded 12,475 matches, which then also includes your
3,279.

But your 3,279 should be fairly uncomplicated to do something about it
seems, though perhaps there too it needs a human to assist since the
metadata and use may vary?


Sincerely,
Jonas

_______________________________________________
Commons-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/commons-l

Re: [Commons-l] Duplicate removal?

Reply via email to