I think there needs to be some human review involved in this unless these are super high confidence merges.
This is probably more appropriate to ol-tech, so bcc'ing ol-discuss On Thu, Aug 29, 2013 at 6:44 AM, Richard Light <[email protected]>wrote: > > In a general spirit of exploration, I took the OL author dump, extracted > authors with dates, converted them to XML and fed them into a Modes [1] > database. I have spent some time tidying up said dates so that they are, > as far as possible, meaningful and indexable. I have limited my attention > to authors with a death date and/or a birth date of 1950 or earlier. > > One potential use of this work, I thought, might be to find duplicate OL > author records which represent the same person. I have discovered the > de-duplication magic wand, and have done a few by hand. However, I am > rather puzzled. For example, the last person I looked at was A. Hamon > (1860-1939). In my Modes data I have two records for him, both with dates: > > http://openlibrary.org/authors/OL5218117A > and > http://openlibrary.org/authors/OL5358432A > > Both of these URLs dereference to an actual page, with associated works. > However, in the de-duplication listing only the first of these identifiers > is present (though I did find another A. Hamon entry to merge). So, two > questions: > > 1. Is there a format in which I can express a set of instructions to merge > authors programmatically, to avoid having to do this by hand? The > excitement of doing this manually has already worn off, but Modes could > easily tell me where authors have the same name and same DoB/DoD and help > me to generate a list of identifiers to merge. > You can look at the URLs produced by my app http://ol-dupes.freebaseapps.com/authors (which needs to be updated with more current data) or just look at the URLs in your browser address bar when you're in the final stage of a dedupe. > 2. Why don't all the potential mergees appear in the merge listing, > despite the fact that loads of clearly irrelevant entries do appear there? > Which dedupe listing? Are you starting from search? This search: http://openlibrary.org/search/authors?q=a.+hamon produces three candidates for me and they all show up in the merge dialog. The merge URL (which is the same one you could generate programatically) is http://openlibrary.org/authors/merge?key=OL5218117A&key=OL3466239A&key=OL5358432A The proposed merge target goes first, followed by all the other candidates. Tom
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
