On Mon, Feb 20, 2012 at 5:43 AM, Ben Companjen <[email protected]> wrote: > I played a little with the datadump of January and found out: > - there are 47796615 records (author, work, edition) in the dump > - distributors is empty in every edition record > - there are 7439623 records that have content in the contributions field > - there are 14996 records with content in the contributors field, > which is an array of (role, name) tuples; there are just under 300 > values for 'role' - that could use some cleanup. > - there are 38 records containing a link to VIAF > - there are 1294 records containing a link to the English Wikipedia > - there are 1864 records with content in the wikipedia field, which > has links to non-English Wikipedias too. > > I published a blog post about this at > http://ben.companjen.name/2012/02/playing-with-the-open-library/
Thanks for posting that. I don't suppose you have counts for all the JSON keys that appear, do you? :-) I'm surprised both by the rarity of the external links and the fact that they're encoded as just raw URLs instead of having some notion that they're identifiers. Focusing on the authors, I count 6,757,452 author records, so the VIAF & Wikipedia links represent a vanishingly small fraction. I took a look at Freebase to see what kinds of cross-referencing it has that would be useful for improving this situation and it has 131,019 authors linked to OpenLibrary who also have links to either VIAF or Wikipedia. 126,810 are linked to OL & VIAF 20,050 are linked to OL & Wikipedia 15,842 are linked to all three The links show up some differences in world views. For example, Stephen King (http://www.freebase.com/view/en/stephen_king) is linked to two Wikipedia articles, one for Stephen King and one for Richard Bachman, his pseudonym. Many library catalogs will also catalog pseudonyms separately, but Freebase treats pseudonyms as simply an alias for real author (which kind of breaks down for collective pseudonyms or house names). If you wanted to, you could got nuts with the cross referencing since, for example, the Stephen King topic is connected to 20+ language Wikipedias in addition to English, the Library of Congress NAF, New York Times, IMDB, MusicBrainz, ISFDB, Netflix, etc. Many of the authors are linked to multiple OpenLibrary entries (sometimes up to 12) which is expected given the number of duplicates in OL, but while doing this analysis I noticed that 125 are linked to multiple VIAF entries. Some of these probably represent pseudoynyms, but a spot check revealed that some of them are just plain wrong, so those'll need to be reviewed. Fortunately, it's just a, relative, handful. Tom p.s. I was going to suggest Google Refine for your data cleanup task, but then noticed from your blog post that you've already discovered it. > > Ben > > On 8 February 2012 15:41, Ben Companjen <[email protected]> wrote: >> Hi all, >> >> (About types) >> I was wondering how the contributors[] field is stored within >> /type/edition. There is no mention at >> http://openlibrary.org/type/edition, but the JSON representation at >> e.g. http://openlibrary.org/books/OL25154702M.json shows the field (a >> compound type with role and name, both plain strings), and the RDF >> template uses the information from the field too (just the name part >> though). >> >> I found out that the contributions[] field (that only has strings) >> contains data that come (probably) straight from the MARC records and >> may or may not have a role included in the string (e.g. "Ben Companjen >> (Editor)". The field is not displayed in the page view, but it is in >> the RDF and JSON views. >> >> In the complete datadump of January 31st, no edition uses the >> distributors[] field. I had already asked myself what it is for, but >> now it seems it isn't used at all. >> >> (About documentation) >> There is an issue on GitHub >> (https://github.com/internetarchive/openlibrary/issues/100) about >> documenting the datadumps, which would include these fields. Perhaps >> it's a good idea to start with the individual properties and types? >> Infogami allows a description of each type and property: >> http://infogami.org/quicklook. Documenting the datadumps would then be >> easy: just copy the applicable property descriptions. >> >> I noticed that on the http://openlibrary.org/developers page, the >> "Bugs" link points to the Launchpad bug tracker (which is no longer >> tracked). Should it be updated to point to the issues on GitHub, and >> should the current bugs on Launchpad be moved? >> >> And what is the status of http://code.openlibrary.org/ ? The >> Developers page points to this documentation through "OL Development". >> Is this documentation still linked to the in-code documentation? >> >> I am looking to see if I can help out with more than the RDF output, >> but I'm having a hard time finding out what some of the code is doing. >> Thanks in advance for the answers. >> >> Regards, >> >> Ben _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
