Just casually clicking on some of the results you pulled, I can see at a
glance that lots of these duplicate uploads are caused because the oldest
version is virtually "unfindable" for Wikimedians; i.e. it is not in any
category whatsoever.

On Thu, Dec 4, 2014 at 7:39 PM, Jonas Öberg <[email protected]>
wrote:

> Hi James,
>
> > * byte-for-byte identical
>
> That's something probably best done by WMF staff themselves, I think a
> simple md5 comparison would give quite a few matches. Doing on the WMF
> side would alleviate the need to transfer large amounts of data.
>
> For the rest, that's something that require a few API lookups only to
> get the relevant information (size etc). I can also imagine that it
> might be useful to take the results we've gotten, apply some secondary
> matching on the pairs that we've identified. Such a secondary matching
> could be more specific than ours to narrow down to true duplicates,
> and also take size into consideration.
>
> That's beyond our need though: we're happy with the information we
> have, and while it would contribute to our work to eliminate
> duplicates in Commons, it's not critical right now. But if someone is
> interested in working with our results or our data, we'd be happy to
> collaborate around that if it would benefit Commons.
>
> Sincerely,
> Jonas
>
> _______________________________________________
> Commons-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/commons-l
>
_______________________________________________
Commons-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/commons-l

Reply via email to