Hi Fae, > Listing identical duplicates with 2 or more files matching would be > simpler but longer; at the moment I count 3,279 files like this on > Commons which took over 9 minutes to run. :-)
This is very interesting. I had a closer look at our matches and it seems that many of them are files where there are slight color variations, or where the jpg has simply been compressed differently, so a sha1 wouldn't mach them against each other. But that speaks in favor of the fact that the matches we find need a human to validate case by case. My Python script is still processing :-) but it's currently recorded 12,475 matches, which then also includes your 3,279. But your 3,279 should be fairly uncomplicated to do something about it seems, though perhaps there too it needs a human to assist since the metadata and use may vary? Sincerely, Jonas _______________________________________________ Commons-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/commons-l
