Just casually clicking on some of the results you pulled, I can see at a glance that lots of these duplicate uploads are caused because the oldest version is virtually "unfindable" for Wikimedians; i.e. it is not in any category whatsoever.
On Thu, Dec 4, 2014 at 7:39 PM, Jonas Öberg <[email protected]> wrote: > Hi James, > > > * byte-for-byte identical > > That's something probably best done by WMF staff themselves, I think a > simple md5 comparison would give quite a few matches. Doing on the WMF > side would alleviate the need to transfer large amounts of data. > > For the rest, that's something that require a few API lookups only to > get the relevant information (size etc). I can also imagine that it > might be useful to take the results we've gotten, apply some secondary > matching on the pairs that we've identified. Such a secondary matching > could be more specific than ours to narrow down to true duplicates, > and also take size into consideration. > > That's beyond our need though: we're happy with the information we > have, and while it would contribute to our work to eliminate > duplicates in Commons, it's not critical right now. But if someone is > interested in working with our results or our data, we'd be happy to > collaborate around that if it would benefit Commons. > > Sincerely, > Jonas > > _______________________________________________ > Commons-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/commons-l >
_______________________________________________ Commons-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/commons-l
