[ol-tech] Merge Source Code

Michael Beccaria Wed, 11 Sep 2013 23:21:36 -0700

I posted this awhile back to the ol-discuss listserve but didn't get a 
response. ol-tech seems more appropriate. Anyone know the basics of getting 
started on this? I'm proficient in coding but don't want to spend hours digging 
through piles of code and framework documentation to find out this might not be 
possible or there was an easier way.

Karen Coyle in the code4lib listserv pointed me in the direction of the source 
code for the merge algorithms OL uses to de-dupe records 
(https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge).
 I'm interested in taking 2 sets of marc records and spitting out either a 
report on similarity or a merged record set. I looked at the code and the OL 
instructions but it isn't clear to me exactly how the merge code fits in and 
whether it is possible to run it independently of the overall system.

Anyone have any insight into this to point me in the right direction?

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
[email protected]
Become a friend of Paul Smith's Library on Facebook today!

-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Anand Chitipothu
Sent: Monday, September 02, 2013 3:32 AM
To: Open Library -- technical discussion
Subject: Re: [ol-tech] MySQL import

On 02-Sep-2013, at 11:27 AM, Ben Companjen wrote:

> Well, it is not live indexing - I know that slows things down a lot. 
> :) It is the first time I used the MySQL Python connector and based my 
> script on an example [1].
> I think the bottleneck may be calling commit() after each edition 
> record. My hard disk is writing almost continuously. I also use one 
> insert statement for each contributor, publisher, identifier, etc. I 
> think dynamically creating insert statements containing all of these 
> may outweigh the many database calls.

The best would be create a tsv file with all the data to be loaded into mysql 
and use "LOAD DATA INFILE 'data.txt' INTO TABLE table_name".

It might be even faster to split that file into smaller files of 100K  lines 
each load them one after other.

If you can show me your script, I can suggest improvements.

Anand
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

[ol-tech] Merge Source Code

Reply via email to