I am using Wikimedia APIs to create a gallery of duplicates and routinely
clean them. You can see the results here.

https://commons.wikimedia.org/wiki/User:Sreejithk2000/Duplicates

The page also has a link to the script. If anyone is interested in using
this script, let me know and I can work with you to customize it.

- Sreejith K.


On Thu, Dec 4, 2014 at 2:46 PM, Fæ <[email protected]> wrote:

> On 4 December 2014 at 18:39, Jonas Öberg <[email protected]>
> wrote:
> >> * byte-for-byte identical
> >
> > That's something probably best done by WMF staff themselves, I think a
> > simple md5 comparison would give quite a few matches. Doing on the WMF
> > side would alleviate the need to transfer large amounts of data.
>
> Volunteers can do this using simple database queries, which is a lot
> more efficient than pulling data out of the API. For example while
> writing this email I knocked out a query to show all non-trivial
> images (>2 pixels wide) on Commons with at least *3* files having the
> same SHA1 checksum and showing each image just once. The matching
> files are listed at the bottom of each image page on Commons.
> Interestingly, this shows that most of the 226 files have been from an
> upload of Gospel illustrations. The low number seems reassuring
> considering the size of Commons. The files are reported in descending
> order by image resolution.
>
> Report:
> http://commons.wikimedia.org/w/index.php?title=User:F%C3%A6/sandbox&oldid=141460887
>
> On its own this is an interesting list to use as a backlog for fixes.
> Listing identical duplicates with 2 or more files matching would be
> simpler but longer; at the moment I count 3,279 files like this on
> Commons which took over 9 minutes to run. :-)
>
> Fae
> --
> [email protected] https://commons.wikimedia.org/wiki/User:Fae
>
> _______________________________________________
> Commons-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/commons-l
>
_______________________________________________
Commons-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/commons-l

Reply via email to