I am using Wikimedia APIs to create a gallery of duplicates and routinely clean them. You can see the results here.
https://commons.wikimedia.org/wiki/User:Sreejithk2000/Duplicates The page also has a link to the script. If anyone is interested in using this script, let me know and I can work with you to customize it. - Sreejith K. On Thu, Dec 4, 2014 at 2:46 PM, Fæ <[email protected]> wrote: > On 4 December 2014 at 18:39, Jonas Öberg <[email protected]> > wrote: > >> * byte-for-byte identical > > > > That's something probably best done by WMF staff themselves, I think a > > simple md5 comparison would give quite a few matches. Doing on the WMF > > side would alleviate the need to transfer large amounts of data. > > Volunteers can do this using simple database queries, which is a lot > more efficient than pulling data out of the API. For example while > writing this email I knocked out a query to show all non-trivial > images (>2 pixels wide) on Commons with at least *3* files having the > same SHA1 checksum and showing each image just once. The matching > files are listed at the bottom of each image page on Commons. > Interestingly, this shows that most of the 226 files have been from an > upload of Gospel illustrations. The low number seems reassuring > considering the size of Commons. The files are reported in descending > order by image resolution. > > Report: > http://commons.wikimedia.org/w/index.php?title=User:F%C3%A6/sandbox&oldid=141460887 > > On its own this is an interesting list to use as a backlog for fixes. > Listing identical duplicates with 2 or more files matching would be > simpler but longer; at the moment I count 3,279 files like this on > Commons which took over 9 minutes to run. :-) > > Fae > -- > [email protected] https://commons.wikimedia.org/wiki/User:Fae > > _______________________________________________ > Commons-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/commons-l >
_______________________________________________ Commons-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/commons-l
