Interesting - thank you again! Would also be nice to make a note of the freebase ID/link in our author description field...
g Tom Morris wrote: > On Thu, Apr 29, 2010 at 1:48 PM, George Oates <[email protected]> wrote: >> Super fantastic, Tom!!! Thanks! >> How many pages are there? > > The web app seems to run into query timeouts around 5 or 6 pages, > perhaps because of the way I'm sorting things, but the grand total is > more than you want to be paging through anyway. I count 7138 authors > after de-duping (18,445 records on the OL side). > > Here's the histogram of counts by number of duplicates: > > 2 6496 > 3 494 > 4 89 > 5 32 > 6 10 > 7 9 > 8 3 > 9 3 > 11 1 > 12 1 > > It's trivial to generate a file of these dupes, but I'd also like to > figure out how this evolves going forward (ie as the Freebase > community identifies additional merges). > > Tom >> >> Tom Morris wrote: >>> Rather than just complain about the data quality, here's a small >>> contribution to help improve it. I put together a little application >>> which shows all authors who have multiple Open Library author records, >>> as identified by the Freebase community. >>> >>> You can find it at http://ol-dupes.freebaseapps.com/authors >>> >>> The list is sorted by from most to least number of duplicates and each >>> entry is linked to all OL records as well as the Freebase record. >>> Freebase uses a slightly different schema, so the authors are linked >>> to Books ("works" in FRBR lingo) and those are linked to Book Editions >>> which equate to the Open Library book records. >>> >>> I also included all the known names for the authors. Most of these >>> will have come from the merger of multiple records. I haven't looked >>> in detail, but it wouldn't surprise me if some of the bad names are >>> from munging on the Freebase side of things. You can see what the >>> name associated with each OL record is by clicking on the ID link. >>> >>> The app is better for browsing than actual data cleanup, but I'd be >>> happy to show someone how to extract the data in a form that could be >>> used in the OL processes (or do it for you). The app is BSD licensed >>> so anyone's free to hack on it as well. >>> >>> Tom >>> _______________________________________________ >>> Ol-discuss mailing list >>> [email protected] >>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss >>> To unsubscribe from this mailing list, send email to >>> [email protected] >> _______________________________________________ >> Ol-discuss mailing list >> [email protected] >> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss >> To unsubscribe from this mailing list, send email to >> [email protected] >> > _______________________________________________ > Ol-discuss mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss > To unsubscribe from this mailing list, send email to > [email protected] _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
