On Mon, Apr 26, 2010 at 8:04 PM, Karen Coyle <[email protected]> wrote:
> Quoting Lars Aronsson <[email protected]>:
>
>> We don't need multiple projects with exchange of
>> data and a never-ending circulation of errors.
>> We need one centralized project, with a focus on
>> quality improvement.
>
> Actually, I disagree about a centralized project. I think those days
> are past. We should now be able to interlink projects, which will
> allow more freedom and innovation, and will let different folks try
> out different approaches. By sharing data we save time and can help
> each other with quality issues. It would definitely be good to have a
> place where all of us working with bibliographic data can hash out
> issues, but I don't think that has developed yet.

I agree that a centralized project is a bad idea.  I'm curious about
the use of the passive voice when talking about a forum for the
coordination of bibliographic data interchange though.  Shouldn't this
be something that the community leaders are actively working on?  Does
Open Library consider itself one of those leaders?  Does Freebase?
How about LibraryThing?  There aren't that many people doing large
scale projects in this space.

>> On www.openlibrary.org the first thing I see is the
>> number of 24 million "books". You got to stop counting
>> all these duplicate records. You must start to focus on
>> quality instead of quantity. There aren't 24 million books.
>> Maybe half of these are duplicate records. Have you
>> got any idea how much junk you are carrying around?
>>
>> On the new "upstream.openlibrary.org" complete beginners
>> are encouraged to add books, as if adding more books
>> was needed. No, it's not. Removing duplicate records
>> is what's needed. Adding birth years and other
>> information to author records is also needed. Things
>> that add quality, not quantity. What percentage of
>> author records have anything more than the name?
>> How do we increase that?
>
> The author names come primarily from library catalogs, and there's a
> practice in libraries that makes sense to librarians but to no one
> else, AFAIK. The birth and death dates in library catalogs are used
> only when they are necessary to distinguish between two authors with
> the same name. So for every "Smith, John, 1906-" there is a "Smith,
> John" who was the first one entered into the catalog (and therefore no
> distinguishing date was needed). (However, I can find exceptions to
> this, as well, so it is very confusing.) I presume that library users
> haven't understood this (and why should they? it's not very logical
> from a user point of view), and probably figure that some names are
> without dates because the librarians didn't know them. This is just
> one of the things that divides libraries from their users.

There's a bigger issue than that from a quality point of view.  If you
look at the author record for "John Smith" or any other pairing of
common first name and common last name with no middle initials, middle
names, or birth/death dates and you'll find that it's a conflation of
every author with that name.  These records are almost entirely
worthless.  Someone needs to go back to the original sources and
create N "John Smith" records for the N institutions that defined them
and then look at how those individual records combine.

> Once the new version of OL is available, the next step is to make it
> possible to merge author names, works, and editions. What merging has
> been done already is based on algorithms, and it appears that some
> data loads didn't get merged property. Solving the quality issues is
> very much on the task list.

I'm not sure I agree that "pretty" is more important than "accurate,"
but that's obviously a prioritization that happened some time ago (and
the new UI *is* very pretty).

What's the action plan for working through the quality issues?  Is
there a timeframe associated with it?  Have you reviewed all the
merges/changes/splits that have been done at Freebase to get a feel
for where the most serious problems are?  Have you looked at their
merge/split/attribution mechanisms to inform the design of your own?
LibraryThing has a big data collection - what could they do to help?

Perhaps all this stuff is crystal clear deep in the bowels of the
Internet Archives, but externally it looks pretty much like nothing is
happening.  I'm not saying that Lars' rant is justified, but I can
certainly understand his frustration at the lack of (visible)
improvement.

Tom
_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to