Karen probably meant to point you to ol-tech, rather than the general ol-discuss, so bcc'ing the latter and adding the former.
On Wed, Aug 28, 2013 at 3:24 PM, Michael Beccaria <[email protected]>wrote: > Karen Coyle in the code4lib listserv pointed me in the direction of the > source code for the merge algorithms OL uses to de-dupe records ( > https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge). > I’m interested in taking 2 sets of marc records and spitting out either a > report on similarity or a merged record set. I looked at the code and the > OL instructions but it isn’t clear to me exactly how the merge code fits in > and whether it is possible to run it independently of the overall system.* > *** > > ** ** > > Anyone have any insight into this to point me in the right direction? > [.sig longer than message elided] >From a cursory glance, it looks to me like it's specific to OL's internal JSON format, but you could probably hook up the OL MARC reader to get records into the necessary format. The other thing I noticed is that there appears to be a lot of code specific to using the Amazon data which might not be appropriate in your case. The basic algorithm is what you'd expect -- compare the significant fields in two records to come up with a weighted score which you then threshold to decide whether they're a match or not, so even if you don't use the code as is, you could extract and reuse the scoring logic pretty easily (assuming you agree with the weights used). Tom
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
