Karen probably meant to point you to ol-tech, rather than the general
ol-discuss, so bcc'ing the latter and adding the former.

On Wed, Aug 28, 2013 at 3:24 PM, Michael Beccaria
<[email protected]>wrote:

>  Karen Coyle in the code4lib listserv pointed me in the direction of the
> source code for the merge algorithms OL uses to de-dupe records (
> https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge).
> I’m interested in taking 2 sets of marc records and spitting out either a
> report on similarity or a merged record set. I looked at the code and the
> OL instructions but it isn’t clear to me exactly how the merge code fits in
> and whether it is possible to run it independently of the overall system.*
> ***
>
> ** **
>
> Anyone have any insight into this to point me in the right direction?
>

[.sig longer than message elided]

>From a cursory glance, it looks to me like it's specific to OL's internal
JSON format, but you could probably hook up the OL MARC reader to get
records into the necessary format.  The other thing I noticed is that there
appears to be a lot of code specific to using the Amazon data which might
not be appropriate in your case.

The basic algorithm is what you'd expect -- compare the significant fields
in two records to come up with a weighted score which you then threshold to
decide whether they're a match or not, so even if you don't use the code as
is, you could extract and reuse the scoring logic pretty easily (assuming
you agree with the weights used).

Tom
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to