[Glamtools] Thinking about Structured Data on Commons ?

James Heald Fri, 15 Aug 2014 09:50:28 -0700

So now Wikimania has been and gone, can we think about where we're atwith the Structured Data initative for Commons ?

In particular Liam, I think you said after one of the sessions you were"all over this" -- it would be good to know your thoughts.

I do think, as the GW toolset community, we ought to have a lot weshould be able to offer here, because essentially we are doing biguploads from data which is *already* structured, so

(i) we've got at least some experience already with working with datathat is at least in some form structured

(ii)   we may know and be able to flag some awkward edge cases

(iii) we would like to accompany uploads with data that can be "bornstructured", rather than converted later(iv) in any case we're uploading a lot of images, which somebody isgoing to have to convert to structured(v) we may have seen (or even written) some of the gnarlier templateson Commons, that migration will have to cope with.

It's not clear (at least not yet to me) how the Multimedia and Wikidatateams may best want to be communicated with, but I'm including Keegan(WMF) in cc:, who I think is the staffer with assigned community liaisonresponsibility.

The biggest message to me from looking through some of the documentsafter the meetings is just how much of the information is going to bestored as part of central main Wikidata.

Essentially, if we upload an image of an object, then it is expectedthat an 'item' (ie a Q-number) for that object will be added toWikidata, which will contain all the metadata that describes the objectrather than just the image.

The Wikidata community is already developing a very strong ontology todescribe such objects -- key resources are


https://www.wikidata.org/wiki/Wikidata:WikiProject_Visual_arts/Item_structure
https://www.wikidata.org/wiki/Wikidata:WikiProject_Books

where there are active and friendly communities involved in refining them.

We can get involved and help the process right now, by trying toidentify and fill and gaps in these ontologies, and by beingenthusiastic early adopters -- there is no reason we should not begetting involved right now, filling in appropriate metadata on Wikidataright now each time we upload an image to Commons -- real-world testingthe current ontologies to see what creaks.

Data specific to the image itself (rather than what it shows) will bestored in a separate Commons Wikibase.

This will include such things as the file name, a file description,photographer, wikicontributor name, precise geographical location etc.

Commons Wikibase is also likely to contain a tag-like "topic list" -- alist of all the Wikidata Q-numbers that apply to the image. These Ithink will be gathered by climbing up the Wikidata tree from anyspecified Subject identified for the image -- so a view of WestminsterAbbey might get topics such as "Westminster; London; England; Cathedral;religious building" etc; and games will be invented to encourage peopleto identify more such topics for the best images.

This should allow WM to introduced a proper combinatorial search enginebased on tags for Commons; and many of the most egregious Commonsintersection categories will wither on the vine. (There is debate as towhether Commons will end up needing *any* category pages, but I suspectit will, because they are just so convenient to use as places forjotting down facts -- on the other hand, it is possible one might beforced to create an associated Commons article/gallery for that).

It would be nice (IMO) if there could be an interface to the topic listthrough the wikisource code for the filepage -- I think this would bewell-received by the community, allow easy adaptation of existing bots,etc. But this may be resisted as being too fragile a point of failure,as it would mean that people making hand-edits would have to know (andget right) the meaningless number-strings of individual Q-numbers.

Finally some very specific text data -- such as the EXIF data describingshutter-speed etc, is likely to continue to live on the file descriptionpage; because it's probably not something that people are primarilygoing to want to search, and it may be a bit unpredictable.

Part of the immediate effort in the next few weeks is going to be toproduce clearer ideas about what information is going to live where, andin particular what information is going to live on the Commons Wikibase,and how it will be structured.

The good news is that much of the most complicated information will bestored on WikiData, so can be as detailed as we like (and can beaccessed live now).

On the other hand, the design for Commons Wikibase will initially aim tobe as simple as possible, with the aim to evolve it as experience isgained, to migrate the edge cases later.

The file description page (or something not entirely unlike it) willcontinue to exist as a view bringing together all the data.

Current templates will be re-written to draw information from Wikidata.However, this won't yet be possible until the Wikidata team hasimplemented the "Arbitrary Access" feature -- the ability for a wikipageto access the properties of an arbitrary Wikidata Q-number. What'scausing the hold-up is that if the properties of the Q-number item areedited, then all the pages that access that Q-number need to be markedas dirty and regenerated. That's easy if you only have one page thatcan access the Q-number, but hard if arbitrary pages can access it,through a chain of properties.

(eg: the file page for a painting Q12345 may use property Pnnn to itscreator Q4567 who has property Pxxx, a date of birth. If the date ofbirth gets made more precise, the system has to recurse back to indicatethat all the file pages showing pictures of that creator's work need tobe regererated. This is tough, but file-page templates won't be able todraw on Wikidata information until it is in place).

It is progressively hoped to simplify the myriad of different templatesused on the file pages as quickly as possible, to standardise them todraw from the structured data stores.

Templates to display summary information about collection objects, whichwill draw from Wikidata, may well be standardised so they can easily beused on Wikipedias and other wikis -- or, to put that the other wayround, since Wikipedias and other wikis will also be developingstandardised templates to display summary object information, it shouldwell be possible to use the same code twice.

However, it would be good to get involved in the development of thesetemplates, to make sure they accurately reflect the information wecurrently like to show in Commons.

(There may be some important details to get right -- for example theWikidata data-type for dates currently comprises a 'best' value, and anoptional numeric range (which is great for sorting). But if thecatalogue source data says eg "mid 17th century to early 18th century",do we want to make sure that precise string is still stored? And shouldit still be possible to make it visible? This needs close engagement;but probably principally with the community-based development effort inthe Wikidata community groups.

Already very standardised are the present Commons creator templates andCommons institution templates. These are likely to be an early quick win.

Looking down a typical present-day filepage, that means that it is theSource/Photographer information in the present "Artist" template, whichis currently free-form and often a freely composed pull together ofmultiple different sources of metadata, that is likely to be going toneed the most work to unpick.

This is also the field most commonly used for the credit link-backtemplates to the originating GLAM institutions, which are obviously akey consideration for our GLAM partners.

These templates may currently often be very institution-specific, andmay do quite complex stuff -- eg the present version of the British Library

https://commons.wikimedia.org/wiki/Template:British_Library_image
as used at eg
https://commons.wikimedia.org/wiki/File:Cuthbert_discovers_piece_of_timber_-_Life_of_St._Cuthbert_%28late_12th_C%29,_f.45v_-_BL_Yates_Thompson_MS_26.jpg

contains link-backs to a number of catalogues, each with their owncorresponding text; and as well as linking back to the information aboutthe underlying object (which is likely to be stored on WikiData), itwill also likely contain a link-back to the source of the original file(in this case the specific file at BL images online), which beinginformation specific to the file is likely to be living on the CommonsWikibase.

The Source/Photographer field as a whole is (I think) likely to be oneof the last on the file page to be assimilated, because it can be sosui-generis, and so the present rats nest of templates may continue tobe acceptable for some time -- though even they are likely to needmodification, as eg Photographer information moves to the Commons Wikibase.

That said, each institution is only going to need to manage its owntemplate.


But it probably would make sense to start an effort to think
* what  is  the structured data that typically lives in these templates?

* and is there some standardisation we could start to get into thebox, even now

Apart from anything else, something readily customisable might be mucheasier for new institutions to adapt and adopt.

For the migration project as a whole, an audit of all the sourcetemplates of this sort would be useful. That is something the MM/WDproject team could perhaps usefully encourage the community to undertakefor them.

I have to admit there are lines I am not sure about, as to what gets aWikidata entry and what does not.


For example, when does a photograph deserve its own entry?

Perhaps a bright line is that an image of a photograph one took oneselfdoesn't get an entry on Wikidata, but a photograph by Man Ray perhaps does.

What about a photograph by a photographer by more intermediatenotability? Or instead, perhaps an engraving from a book of 19thcentury engravings?

It makes sense to create an identifier for the book on Wikidata; andalso the place depicted. This is often almost enough to identify theparticular image, but really one would want to store the page number,and perhaps the scan number as well. (Since one might well have eitherone or the other or both). It would probably be good to store someidentifier for the set of scans as well -- this too probably doesn'tbelong on Wikidata, (although one might identify it as set number<identifer> from eg the Mechanical Curator collection, which itselfprobably then *would* get a Wikidata identifier).

So the Commons wikibase probably needs to be able to identify images ashaving a sequence in a particular set, and that set as perhaps having anidentifer that links it to a collection which has a particular Q-numberon Wikidata.

This is the kind of thinking we will particularly need to be doing overthe next few weeks -- what is the metadata that will *not* be stored onWikidata, so will *need* to be stored on the Commons Wikibase if it isto continue to be accessible? That is something that we as thecommunity need to evolve, thinking of all the use cases we can.

I am sure that there was something else I meant to say, but this emailseems long enough already.

There's a scratchpad of some bookmarks I started keeping on a subpage ofmy userpage at Wikidata that people are welcome to,


https://www.wikidata.org/wiki/User:Jheald/bookmarks

This gives a nutshell of where some different fields might be stored

https://docs.google.com/presentation/d/1x-vOUr-zveLzoIP6uJC1Sz95xwuTFNwaBqJqmtWH8Qk/edit#slide=id.g3704ec6dd_2_554

This etherpad is good, esp lines immediately after 140, and "What newfields should be created to complement the old fields?" at 156 (actuallyin the context of Upload Wizard, but it gives some ideas)


http://etherpad.wikimedia.org/p/multimedia-wikidata-catchup

There's a spreadsheet showing some of the fields they're thinking about

https://docs.google.com/spreadsheets/d/1rk05EcLZpJaqOh5wymK6teIQufH9t0xn6oDPeyJHap0/edit#gid=0

-- though I think quite a lot of what's down as living on WikiDatashould really be Commons WikiBase --


and also a suggestion based on some simple use-cases:

https://docs.google.com/document/d/1C7UTB1kbaf_EisF3LmhpIQkb_ifSkB0rD8IGI9aMQhM/edit

though I think we would probably see that as *too* simple, even for afirst build, because for many of our applications


    sequence-number in set
&   set-identifer in collection

are probably essential quantities to have (as they probably are for theWikiSource collection too).

Finally, this is an etherpad from the Hackathon just been, which has alot of useful links at the end.

http://etherpad.wikimedia.org/p/structured-data-discussion-7-august-2014

Hope this initial brain dump is of at least some use, to make it worthits length,


All best,

   James Heald.   (User:Jheald).





_______________________________________________
Glamtools mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/glamtools

[Glamtools] Thinking about Structured Data on Commons ?

Reply via email to