Re: [Commons-l] Duplicate removal?

Fæ Thu, 04 Dec 2014 11:47:07 -0800

On 4 December 2014 at 18:39, Jonas Öberg <[email protected]> wrote:
>> * byte-for-byte identical
>
> That's something probably best done by WMF staff themselves, I think a
> simple md5 comparison would give quite a few matches. Doing on the WMF
> side would alleviate the need to transfer large amounts of data.


Volunteers can do this using simple database queries, which is a lot
more efficient than pulling data out of the API. For example while
writing this email I knocked out a query to show all non-trivial
images (>2 pixels wide) on Commons with at least *3* files having the
same SHA1 checksum and showing each image just once. The matching
files are listed at the bottom of each image page on Commons.
Interestingly, this shows that most of the 226 files have been from an
upload of Gospel illustrations. The low number seems reassuring
considering the size of Commons. The files are reported in descending
order by image resolution.

Report: 
http://commons.wikimedia.org/w/index.php?title=User:F%C3%A6/sandbox&oldid=141460887

On its own this is an interesting list to use as a backlog for fixes.
Listing identical duplicates with 2 or more files matching would be
simpler but longer; at the moment I count 3,279 files like this on
Commons which took over 9 minutes to run. :-)

Fae
-- 
[email protected] https://commons.wikimedia.org/wiki/User:Fae

_______________________________________________
Commons-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/commons-l

Re: [Commons-l] Duplicate removal?

Reply via email to