Hi Tom, On 20 February 2012 21:01, Tom Morris <[email protected]> wrote: > On Mon, Feb 20, 2012 at 5:43 AM, Ben Companjen <[email protected]> wrote: >> I played a little with the datadump of January and found out: >> - there are 47796615 records (author, work, edition) in the dump >> - distributors is empty in every edition record >> - there are 7439623 records that have content in the contributions field >> - there are 14996 records with content in the contributors field, >> which is an array of (role, name) tuples; there are just under 300 >> values for 'role' - that could use some cleanup. >> - there are 38 records containing a link to VIAF >> - there are 1294 records containing a link to the English Wikipedia >> - there are 1864 records with content in the wikipedia field, which >> has links to non-English Wikipedias too. >> >> I published a blog post about this at >> http://ben.companjen.name/2012/02/playing-with-the-open-library/ > > Thanks for posting that. I don't suppose you have counts for all the > JSON keys that appear, do you? :-)
No, I haven't, unfortunately. These experiments were very basic: extract lines, write them to separate files, count the lines. I may try to extract more statistics, but I first have to find a more efficient way of doing so. I did start to manually extract a list of elements in author and edition records that is more complete than http://openlibrary.org/type/author and http://openlibrary.org/type/edition. Perhaps it can serve as the basis for the documentation of the datadumps (issue #100 on GitHub). I think I will add it to the issue as a note, as I cannot update the documentation pages myself. > > I'm surprised both by the rarity of the external links and the fact > that they're encoded as just raw URLs instead of having some notion > that they're identifiers. VIAF URIs can be used as identifiers. Wikipedia URIs are (in the linked data sense) not identifiers of the people, but identifiers of pages about the people. It's a pity that DBpedia only uses the English Wikipedia to create the identifiers for the people we are talking about, because some Wikipedia pages are only available in non-English Wikipedias. Of course Wikipedia URIs can still be used to match OL and other profiles. I think that the "uris" field can be used to store non-OL URI references of an author, like VIAF and DBpedia URIs. Can anyone confirm or correct me on that? > > Focusing on the authors, I count 6,757,452 author records, so the VIAF > & Wikipedia links represent a vanishingly small fraction. There is no explicit incentive for users to add these links, and I'd like to see that the person identifiers like VIAF ids are treated as such, not just as link, which therefore has to be done programmatically instead of manually. Of course the former is much faster, but I don't have a bot at hand to do it :) > > I took a look at Freebase to see what kinds of cross-referencing it > has that would be useful for improving this situation and it has > 131,019 authors linked to OpenLibrary who also have links to either > VIAF or Wikipedia. > > 126,810 are linked to OL & VIAF > 20,050 are linked to OL & Wikipedia > 15,842 are linked to all three > > The links show up some differences in world views. For example, > Stephen King (http://www.freebase.com/view/en/stephen_king) is linked > to two Wikipedia articles, one for Stephen King and one for Richard > Bachman, his pseudonym. Many library catalogs will also catalog > pseudonyms separately, but Freebase treats pseudonyms as simply an > alias for real author (which kind of breaks down for collective > pseudonyms or house names). That is promising. I believe the Freebase licence is not PD, so can it be used to match OL authors to VIAF? > > If you wanted to, you could got nuts with the cross referencing since, > for example, the Stephen King topic is connected to 20+ language > Wikipedias in addition to English, the Library of Congress NAF, New > York Times, IMDB, MusicBrainz, ISFDB, Netflix, etc. Let's stick to VIAF and Wikipedia for now and have all the other parties do that too... ;) > > Many of the authors are linked to multiple OpenLibrary entries > (sometimes up to 12) which is expected given the number of duplicates > in OL, but while doing this analysis I noticed that 125 are linked to > multiple VIAF entries. Some of these probably represent pseudoynyms, > but a spot check revealed that some of them are just plain wrong, so > those'll need to be reviewed. Fortunately, it's just a, relative, > handful. There are errors in VIAF too - I think I already mentioned seeing four identities for the Museum of Modern Art in New York. Perhaps you can publish your list of entries with errors somewhere so that they can be checked and if needed be fixed? > > Tom > > p.s. I was going to suggest Google Refine for your data cleanup task, > but then noticed from your blog post that you've already discovered > it. I like Google Refine, but it didn't take my 8 GB file of records with "contributions" ;) Ben > >> >> Ben _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
