Hi Martha, This is good news. I hope I can accomplish this.
I have been loading files of 1,000 records each. It takes about 35-40 minutes per file but they don't time out. I do the queue first to get some idea of how many will match and on what because sometimes I want to overlay and preserve the 856s and sometimes I want to just add the new 856s. Then I start the load. While that load is processing, I do another queue. Still I was only loading 5 or 6 files a day so this will definitely speed up the process. Two libraries in our consortium want me to load EBSCO records, 112,000 for each library. If you select to overlay 1 match and import non-matching do you get a list of the records that didn't load? Those would be ones with 2 or more matches. Thanks, Janet Janet Schrader C/W MARS Inc. Supervisor of Bibliographic Services 67 Millbrook Street, Suite 201 Worcester, MA 01606 tel: 508-755-3323 ext. 25 fax: 508-757-7801 [email protected] -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Martha Driscoll Sent: Friday, March 14, 2014 4:30 PM To: [email protected] Subject: [OPEN-ILS-GENERAL] Marc_stream_importer for batch loading We have recently come up with a good way to load electronic resource records that I wanted to share. We have been struggling with how to load our electronic resource marc records into Evergreen. We constantly receive files from vendors and our cataloger loads them through Vandelay. Sometimes the records match on-file records and just add an 856 link. Other records are new and need to be added. Vandelay is a great tool because you can setup match criteria and overlay profiles. The only problem is Vandelay will timeout with a file of more than 500 records. We have tried splitting the files into 500-record chunks, but the overhead in queuing up the files, especially when you split a 20,000-record file into 40 pieces, can add up. The solution we have been happy with is an updated version of marc_stream_importer.pl that Bill Erickson recently worked on (LP# 1279998). Bill added support for overlay 1 match, overlay best match, and import non-matching records. By default marc_stream_importer assumes you have supplied a record ID in a 901 $c. This version now supports all the vandelay options but can be run from the command line which also means you can script the loading of records. Here is how I load a file: marc_stream_importer.pl --spoolfile /home/opensrf/file-7 --user xxx --password xxx --source 102 --merge-profile 2 --queue 11391 --auto-overlay-best-match --import-no-match --nodaemon The record source and merge profile are specified on the command line. The queue contains the record match set. If there are no errors, marc_stream_importer will empty the queue. I can find the record ID's of records added or updated in the log files: #!/usr/bin/perl @imported = `grep queue=11391 /var/log/evergreen/prod/2014/03/14/activity.log`; foreach $line (@imported) { if ($line =~ /imported_as= ischanged/) {next}; $line =~ s/.*(imported_as=[0-9]+) .*/\1/; print $line; } Marc_stream_importer, like Vandelay, still has problems loading more than 500 records at a time. I was getting 'out of shared memory errors (see LP#1271661). The good news is that files can be easily split using yaz-marcdump and then the commands can be stacked in a shell script. Here is how to split a file into 500-record files: yaz-marcdump -i marc -o marc -s file- -C 500 mybigfile.mrc > /dev/null Then it's just a matter of creating a shell script to run through the files one at a time piping the output to a log file so I can verify the records loaded. Over the last 4 nights I was able to load 4 files of 5900 records each. -- Martha Driscoll Systems Manager North of Boston Library Exchange Danvers, Massachusetts www.noblenet.org
