On 4 December 2014 at 18:39, Jonas Öberg <[email protected]> wrote: >> * byte-for-byte identical > > That's something probably best done by WMF staff themselves, I think a > simple md5 comparison would give quite a few matches. Doing on the WMF > side would alleviate the need to transfer large amounts of data.
Volunteers can do this using simple database queries, which is a lot more efficient than pulling data out of the API. For example while writing this email I knocked out a query to show all non-trivial images (>2 pixels wide) on Commons with at least *3* files having the same SHA1 checksum and showing each image just once. The matching files are listed at the bottom of each image page on Commons. Interestingly, this shows that most of the 226 files have been from an upload of Gospel illustrations. The low number seems reassuring considering the size of Commons. The files are reported in descending order by image resolution. Report: http://commons.wikimedia.org/w/index.php?title=User:F%C3%A6/sandbox&oldid=141460887 On its own this is an interesting list to use as a backlog for fixes. Listing identical duplicates with 2 or more files matching would be simpler but longer; at the moment I count 3,279 files like this on Commons which took over 9 minutes to run. :-) Fae -- [email protected] https://commons.wikimedia.org/wiki/User:Fae _______________________________________________ Commons-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/commons-l
