On Thu, Apr 29, 2010 at 1:48 PM, George Oates <[email protected]> wrote:
> Super fantastic, Tom!!! Thanks!
> How many pages are there?

The web app seems to run into query timeouts around 5 or 6 pages,
perhaps because of the way I'm sorting things, but the grand total is
more than you want to be paging through anyway.  I count 7138 authors
after de-duping (18,445 records on the OL side).

Here's the histogram of counts by number of duplicates:

2 6496
3   494
4   89
5   32
6   10
7    9
8    3
9    3
11  1
12  1

It's trivial to generate a file of these dupes, but I'd also like to
figure out how this evolves going forward (ie as the Freebase
community identifies additional merges).

Tom
>
>
> Tom Morris wrote:
>> Rather than just complain about the data quality, here's a small
>> contribution to help improve it.  I put together a little application
>> which shows all authors who have multiple Open Library author records,
>> as identified by the Freebase community.
>>
>> You can find it at http://ol-dupes.freebaseapps.com/authors
>>
>> The list is sorted by from most to least number of duplicates and each
>> entry is linked to all OL records as well as the Freebase record.
>> Freebase uses a slightly different schema, so the authors are linked
>> to Books ("works" in FRBR lingo) and those are linked to Book Editions
>> which equate to the Open Library book records.
>>
>> I also included all the known names for the authors.  Most of these
>> will have come from the merger of multiple records.  I haven't looked
>> in detail, but it wouldn't surprise me if some of the bad names are
>> from munging on the Freebase side of things.  You can see what the
>> name associated with each OL record is by clicking on the ID link.
>>
>> The app is better for browsing than actual data cleanup, but I'd be
>> happy to show someone how to extract the data in a form that could be
>> used in the OL processes (or do it for you).  The app is BSD licensed
>> so anyone's free to hack on it as well.
>>
>> Tom
>> _______________________________________________
>> Ol-discuss mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
>> To unsubscribe from this mailing list, send email to 
>> [email protected]
> _______________________________________________
> Ol-discuss mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
> To unsubscribe from this mailing list, send email to 
> [email protected]
>
_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to