Re: [ol-discuss] Disambiguating authors

Richard Light Thu, 22 Jul 2010 00:59:47 -0700

In message 
<[email protected]>, Tom 
Morris <[email protected]> writes
>On Wed, Jul 21, 2010 at 1:56 PM, Edward Betts <[email protected]> wrote:
>
>> Library MARC records use birth and death dates to disambiguate authors
>> with the same name. The problem is that some MARC records aren't that
>> great, they contain mistakes, or are missing the dates. We also load
>> data from non-MARC sources. We use some heuristics to try and guess if
>> the author represents the same person or not. We're always trying to
>> improve these heuristics. For example we should be looking at the type
>> of subjects that an author writes about and see if the new book we're
>> loading matches the profile of an existing author with that name.
>
>You should never assume that authors are the same based on name alone.


Agreed. A person's name is just a property of that person: it isn't an 
identity. For a start it's (clearly) not unique; second it's not 
immutable (women's married names; life peers; Charles Dodgson). To have 
a reasonable guarantee of a person's identity you need to match on a 
number of properties.

The larger the pool of people you are dealing with, the more properties 
- and the more precision - you need. Thus in a typical workplace you can 
often get away with just using first names as identifiers (with a 
surname initial as a disambiguator where you have two or more "Richard"s 
etc.). For authors, the conventional wisdom is that name and dates is a 
sufficient set of properties, but again there is the question of 
precision: do you include all names, and do the dates go down to the 
year, or the day, of birth and death?

>It's a lot easier to merge duplicates than it is to tease apart bad
>merges, so it is, in my opinion, much better to be very conservative
>in any automatic matching process.

How far you go with this is a matter of choice, but I would certainly 
suggest that you don't limit yourself to MARC practices if you want to 
offer a useful Linked Data resource.

The first question to address is whether you want to create an author 
authority file, or a person authority file which happens to contain lots 
of authors. If it's just an author file, presumably you then require a 
separate file for people who are the subjects of works?  So Winston 
Churchill, for example, might then end up with two identifiers, because 
he was both an author and the subject of books, e.g.:

http://openlibrary.org/authors/OL123456A
http://openlibrary.org/subject/OL456789B

I would argue for designing a person authority framework which allows 
you to record a number of properties (where they are known) about any 
person, and assigns a unique, persistent identifier only when you have 
enough properties to be "sure" of the person's unique identity. As 
properties, I would certainly include name (repeatable, with some sort 
of type qualifier), date _and_place_ of birth and death.  Then you can 
choose whether to use this framework for a single person authority or 
for separate author and "people as subject" authorities.

Remember that the author statement can be "this book was written by a 
person whose name was Richard Light", which is true even when you aren't 
sure of the identity of "Richard Light".

Richard
-- 
Richard Light
_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Disambiguating authors

Reply via email to