Re: [Commons-l] Hashing Wikimedia Commons

Jean-Frédéric Thu, 04 Sep 2014 03:32:09 -0700

Hi Jonas,

Awesome project!


I’m cc-ing the WMF Multimedia team, who might have some more answers :)


2014-09-04 12:26 GMT+02:00 Jonas Öberg <[email protected]>:

> Dear all,
>
> some of you may have been at our presentation during Wikimania and you'll
> find this familiar, but for the rest of you, I'm working with Commons
> Machinery on software that will hope to identify images on the web, even
> when they are used outside of their original context, to provide automatic
> attribution and a referral back to its origin. Imagine a blogger using a
> photo from Commons, visiting that blog and having a browser plugin overlay
> a small icon showing that the image is from Commons and inviting to find
> out more - even if the blogger forgot to attribute.
>
> We're currently working on an addon for Firefox to do just this, and we've
> previously worked out a backend to store the information we need to make
> these matches, some utilities for perceptual image hashing etc. We would
> love to work with images from Wikimedia Commons as a first dataset to
> explore how this will all work in practice.
>
> But in order to do so, we need information from Commons, and we want to
> make this as easy on the WMF servers as possible, so we'd appreciate some
> help and pointers. What we're looking at retrieving is information about
> (1) title, (2) author, (3) license, and (4) thumbnails of medium size.
>
> The first three we can get from pretty much either API, or extract
> directly from a dump file. The latter is eluding us though, for two
> reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually
> in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is
> unclear to us now, and it's not something we find in the dumps - though we
> can get it from one of the APIs.
>
> The other is thumbnail sizes. We need to retrieve a reasonably sized image
> (but in many cases less than the original size) of about 640px wide, so
> that we can then run a perceptual hash algorithm on this file.
>
> From what we can understand, you can request any size thumbnail on an
> image simply by prefixing it with the size you want (like
> 123x-Filename.jpg). But it seems really silly to always request 640x for
> instance, since that would mean the WMF servers would need to generate that
> for us specifically if the resolution doesn't exist.
>
> What we'd find much more appealing is to be able to determine before
> making the call what sizes already exist and which can be retrieved without
> the WMF servers needing to rescale them for us. And while the viewer on
> Commons do seem to offer thumbnails in various sizes, we can't seem to get
> that information from any API.
>
> We can scrape the Commons web page for this information, but we figured
> that people here might have good ideas for how we approach this with
> minimal impact on the WMF servers :)
>
> Sincerely,
> Jonas
>
>
> _______________________________________________
> Commons-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/commons-l
>
>


-- 
Jean-Frédéric

_______________________________________________
Commons-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/commons-l

Re: [Commons-l] Hashing Wikimedia Commons

Reply via email to