Hi Tom,

On 20 February 2012 21:01, Tom Morris <[email protected]> wrote:
> On Mon, Feb 20, 2012 at 5:43 AM, Ben Companjen <[email protected]> wrote:
>> I played a little with the datadump of January and found out:
>> - there are 47796615 records (author, work, edition) in the dump
>> - distributors is empty in every edition record
>> - there are 7439623 records that have content in the contributions field
>> - there are 14996 records with content in the contributors field,
>> which is an array of (role, name) tuples; there are just under 300
>> values for 'role' - that could use some cleanup.
>> - there are 38 records containing a link to VIAF
>> - there are 1294 records containing a link to the English Wikipedia
>> - there are 1864 records with content in the wikipedia field, which
>> has links to non-English Wikipedias too.
>>
>> I published a blog post about this at
>> http://ben.companjen.name/2012/02/playing-with-the-open-library/
>
> Thanks for posting that.  I don't suppose you have counts for all the
> JSON keys that appear, do you? :-)

No, I haven't, unfortunately. These experiments were very basic:
extract lines, write them to separate files, count the lines. I may
try to extract more statistics, but I first have to find a more
efficient way of doing so.

I did start to manually extract a list of elements in author and
edition records that is more complete than
http://openlibrary.org/type/author and
http://openlibrary.org/type/edition. Perhaps it can serve as the basis
for the documentation of the datadumps (issue #100 on GitHub). I think
I will add it to the issue as a note, as I cannot update the
documentation pages myself.

>
> I'm surprised both by the rarity of the external links and the fact
> that they're encoded as just raw URLs instead of having some notion
> that they're identifiers.

VIAF URIs can be used as identifiers. Wikipedia URIs are (in the
linked data sense) not identifiers of the people, but identifiers of
pages about the people. It's a pity that DBpedia only uses the English
Wikipedia to create the identifiers for the people we are talking
about, because some Wikipedia pages are only available in non-English
Wikipedias. Of course Wikipedia URIs can still be used to match OL and
other profiles.
I think that the "uris" field can be used to store non-OL URI
references of an author, like VIAF and DBpedia URIs. Can anyone
confirm or correct me on that?

>
> Focusing on the authors, I count 6,757,452 author records, so the VIAF
> & Wikipedia links represent a vanishingly small fraction.

There is no explicit incentive for users to add these links, and I'd
like to see that the person identifiers like VIAF ids are treated as
such, not just as link, which therefore has to be done
programmatically instead of manually. Of course the former is much
faster, but I don't have a bot at hand to do it :)
>
> I took a look at Freebase to see what kinds of cross-referencing it
> has that would be useful for improving this situation and it has
> 131,019 authors linked to OpenLibrary who also have links to either
> VIAF or Wikipedia.
>
> 126,810 are linked to OL & VIAF
>  20,050 are linked to OL & Wikipedia
>  15,842 are linked to all three
>
> The links show up some differences in world views.  For example,
> Stephen King (http://www.freebase.com/view/en/stephen_king) is linked
> to two Wikipedia articles, one for Stephen King and one for Richard
> Bachman, his pseudonym.  Many library catalogs will also catalog
> pseudonyms separately, but Freebase treats pseudonyms as simply an
> alias for real author (which kind of breaks down for collective
> pseudonyms or house names).

That is promising. I believe the Freebase licence is not PD, so can it
be used to match OL authors to VIAF?
>
> If you wanted to, you could got nuts with the cross referencing since,
> for example, the Stephen King topic is connected to 20+ language
> Wikipedias in addition to English, the Library of Congress NAF, New
> York Times, IMDB, MusicBrainz, ISFDB, Netflix, etc.

Let's stick to VIAF and Wikipedia for now and have all the other
parties do that too... ;)
>
> Many of the authors are linked to multiple OpenLibrary entries
> (sometimes up to 12) which is expected given the number of duplicates
> in OL, but while doing this analysis I noticed that 125 are linked to
> multiple VIAF entries.  Some of these probably represent pseudoynyms,
> but a spot check revealed that some of them are just plain wrong, so
> those'll need to be reviewed.  Fortunately, it's just a, relative,
> handful.

There are errors in VIAF too - I think I already mentioned seeing four
identities for the Museum of Modern Art in New York. Perhaps you can
publish your list of entries with errors somewhere so that they can be
checked and if needed be fixed?
>
> Tom
>
> p.s. I was going to suggest Google Refine for your data cleanup task,
> but then noticed from your blog post that you've already discovered
> it.

I like Google Refine, but it didn't take my 8 GB file of records with
"contributions" ;)

Ben
>
>>
>> Ben
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to