Am 04.12.2014 19:39, schrieb Jonas Öberg: > Hi James, > >> * byte-for-byte identical > > That's something probably best done by WMF staff themselves, I think a > simple md5 comparison would give quite a few matches. Doing on the WMF > side would alleviate the need to transfer large amounts of data.
This is happening automatically: the SHA1 hash of every file is computed on upload, and placed in the img_sha1 field on the database. I believe this is used to warn users who try to upload an exact duplicate, but I'm not sure this is true. Anyway, *exact* duplicates can easily be found in the database by anyone who has an account on toollabs. The relevant query is: select A.img_name, A.img_sha1, B.img_name from image as A join image as B on A.img_sha1 = B.img_sha1 and A.img_name < B.img_name; Having a list of "effective" duplicates, such as the same image in slightly different resolution or compression, would of course be very interesting. -- daniel _______________________________________________ Commons-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/commons-l
