I posted this awhile back to the ol-discuss listserve but didn't get a response. ol-tech seems more appropriate. Anyone know the basics of getting started on this? I'm proficient in coding but don't want to spend hours digging through piles of code and framework documentation to find out this might not be possible or there was an easier way.
Karen Coyle in the code4lib listserv pointed me in the direction of the source code for the merge algorithms OL uses to de-dupe records (https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/merge). I'm interested in taking 2 sets of marc records and spitting out either a report on similarity or a merged record set. I looked at the code and the OL instructions but it isn't clear to me exactly how the merge code fits in and whether it is possible to run it independently of the overall system. Anyone have any insight into this to point me in the right direction? Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 [email protected] Become a friend of Paul Smith's Library on Facebook today! -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Anand Chitipothu Sent: Monday, September 02, 2013 3:32 AM To: Open Library -- technical discussion Subject: Re: [ol-tech] MySQL import On 02-Sep-2013, at 11:27 AM, Ben Companjen wrote: > Well, it is not live indexing - I know that slows things down a lot. > :) It is the first time I used the MySQL Python connector and based my > script on an example [1]. > I think the bottleneck may be calling commit() after each edition > record. My hard disk is writing almost continuously. I also use one > insert statement for each contributor, publisher, identifier, etc. I > think dynamically creating insert statements containing all of these > may outweigh the many database calls. The best would be create a tsv file with all the data to be loaded into mysql and use "LOAD DATA INFILE 'data.txt' INTO TABLE table_name". It might be even faster to split that file into smaller files of 100K lines each load them one after other. If you can show me your script, I can suggest improvements. Anand _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected] _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
