On 04/24/2014 06:12 PM, Kyle Tomita wrote: > Hi Everyone, > > > > I am tasked with importing a large set of bibliographic marc records > (under 1 million). > > > > I have leveraged work from Jason Stephenson, > http://git.mvlcstaff.org/?p=jason/backstage.git;a=summary. > > > > The import script has been modified to instead of doing the update, to > create sql files with the update commands. These files have about > 10,000 records per file. This bypasses checking with database and just > creates the update scripts off of the marc records.
That's a common approach, but not one that I recommend. When I do batches I either use fork() in Perl or use Java threads. I will also use the Perl DBI module or Java's JDBC. Using batch SQL statements with threads in Java has given me the best load performance, even faster than doing COPY statements with files on the server. I typically have the data files and software on a server other than the database server. That is, I do the load over the network rather than on the database server itself. If the OpenSRF services are stopped or the load happens in the middle of the night when the system is not busy, I'll have the software simultaneously run a number of batches equal to the number of CPU cores on the database server minus 1. If I'm doing this during the day, particularly when libraries are open, I'll either not batch the updates or do no more than half the number of cores on the CPU. Massive bib loads or updates are best done when the system is not busy. > > > > These files are then batch processed. > > > > This process ignores overlay profiles, which was deemed not needed for > this process. Depending on what's in the incoming records, you may be able to match on 901$c or whatever. I add this for the benefit of others reading the thread, mainly. The Backstage program you linked above will match on 901$c, since the files Backstage sends are cleaned up versions of records that originated in our database. I have done loads with matches on ISBNs and such, but I never used overlay profiles for that. I wrote my own matching code using the metabib identifier field tables and others. Because of the way many records get cataloged, such matching is inexact. Even overlay profiles will match on more than 1 record. What I've done when matching records is not import the incoming record when it matched an existing record. That was for the case of importing records from a new member library joining our consortium, so if you have updates for your database, you'll want to handle that differently. > > > > Before the update, triggers on the biblio.record_entry are turned off, > particularly the reingest. We run a full reingest after all the records > have been updated. I did that for our migration. I don't do that with most of the updates from Backstage I run today. They are typically only a few thousand records at a time these days. There are some other flags you can adjust that will further improve the load speed. For instance, I set enabled to true on the following during our migration in 2011: ingest.metarecord_mapping.skip_on_insert ingest.disable_authority_linking ingest.assume_inserts_only Running the query below in your Evergreen database will reveal all of the ingest-related settings that you might want to turn on or turn off depending on your situation: select name from config.internal_flag where name like 'ingest.%'; If you do mess with any of those flags, you'll need to be sure to do the appropriate steps after your load finishes. If you forget, search will not give the results you want. You will also want to remember to set the flags back to their original values when you're done. > > > > My reason for posting this is to get feedback from others who are > charged with updating a large set of bib records (over 500,000) about > the way in which they succeeded and also pitfalls. It is a fairly common question. Unfortunately, there isn't really a one-size-fits-all solution. The changes to marc_stream_importer to allow it to load records from files instead of just over a network port is a step in that direction. I still typically end up writing something unique for each batch of records that I have to deal with. I find that I very often have to do special scrubbing of each set of records. Records from different sources often have their own quirks that match up with what we do at MVLC. Depending on the time that I have to devote to the project and the number of incoming records, I do not usually split the load up into batches and just run the whole thing through in sequence. I will typically do this such that each record is its own database transaction with appropriate error handlers so that one failed record doesn't stop the whole batch. The timeliness of getting the records loaded (i.e. "by next Wednesday" or some such) typically overrides the raw speed of loading the records. It is rare that I need to worry about loading X number of records per minute. > > > > Kyle Tomita > > Developer II, Catalyst IT Services > > Beaverton Office > > > > > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > -- Jason Stephenson Assistant Director for Technology Services Merrimack Valley Library Consortium 1600 Osgood ST, Suite 2094 North Andover, MA 01845 Phone: 978-557-5891 Email: [email protected]
