Hi Jonas, Awesome project!
I’m cc-ing the WMF Multimedia team, who might have some more answers :) 2014-09-04 12:26 GMT+02:00 Jonas Öberg <[email protected]>: > Dear all, > > some of you may have been at our presentation during Wikimania and you'll > find this familiar, but for the rest of you, I'm working with Commons > Machinery on software that will hope to identify images on the web, even > when they are used outside of their original context, to provide automatic > attribution and a referral back to its origin. Imagine a blogger using a > photo from Commons, visiting that blog and having a browser plugin overlay > a small icon showing that the image is from Commons and inviting to find > out more - even if the blogger forgot to attribute. > > We're currently working on an addon for Firefox to do just this, and we've > previously worked out a backend to store the information we need to make > these matches, some utilities for perceptual image hashing etc. We would > love to work with images from Wikimedia Commons as a first dataset to > explore how this will all work in practice. > > But in order to do so, we need information from Commons, and we want to > make this as easy on the WMF servers as possible, so we'd appreciate some > help and pointers. What we're looking at retrieving is information about > (1) title, (2) author, (3) license, and (4) thumbnails of medium size. > > The first three we can get from pretty much either API, or extract > directly from a dump file. The latter is eluding us though, for two > reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually > in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is > unclear to us now, and it's not something we find in the dumps - though we > can get it from one of the APIs. > > The other is thumbnail sizes. We need to retrieve a reasonably sized image > (but in many cases less than the original size) of about 640px wide, so > that we can then run a perceptual hash algorithm on this file. > > From what we can understand, you can request any size thumbnail on an > image simply by prefixing it with the size you want (like > 123x-Filename.jpg). But it seems really silly to always request 640x for > instance, since that would mean the WMF servers would need to generate that > for us specifically if the resolution doesn't exist. > > What we'd find much more appealing is to be able to determine before > making the call what sizes already exist and which can be retrieved without > the WMF servers needing to rescale them for us. And while the viewer on > Commons do seem to offer thumbnails in various sizes, we can't seem to get > that information from any API. > > We can scrape the Commons web page for this information, but we figured > that people here might have good ideas for how we approach this with > minimal impact on the WMF servers :) > > Sincerely, > Jonas > > > _______________________________________________ > Commons-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/commons-l > > -- Jean-Frédéric
_______________________________________________ Commons-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/commons-l
