On Tue, Sep 30, 2008 at 12:31 PM, Andrews, Mark J. <[EMAIL PROTECTED]> wrote: > Picking up on Dan Well's note about revising marc2bre, I've been playing > with an older VMWare image of Evergreen (v1.2.1.3 or something like that). > Why? 'Cause this image was built with 20 GB of (virtual) disk space, plenty > to handle a copy of Creighton's bibs (669,000) and associated items (1.1 > million). There are other, newer VMWare images, but they don't have so much > space. > > > > Lots of space give me room to run scripts against, say, a 1 GB input file > and get ginormous (1, 2, 4, 5, 7 or more GB) output files out the other > end. > > > > My problem at the moment is (I'm guessing the file name, but you'll know > what I mean) pg_loader_bre.ql (or something like that) contains duplicate > "bre" records. The target table in PostgreSQL has at least one column set > to "no dups," so the import fails. Dan Scott suggested I grep around the > duplicate records. That's always an option, but I reasoned it'd be quicker > to create a clean export file from the source system, and then have that > clean file to process on the target side. > > > > I found a way to tell the export program on the source side to put an > integer into the tag and subfield of my choice. This integer value simply > numbers the bib records in the output file from first to last. That way I > have a guaranteed (sic?) unique ID number in the source file. However, I > discovered on import that there is still some other field declared as > unique, which causes PostgreSQL to do what it does, and stop the import when > it finds a duplicate key. Hmmm, what to do? > > > > I suggest the processing script (somehow) identify duplicate records, write > them to an exception file, and skip to the next record. This is potentially > difficult because the import scripts, several *.sql files, contain related > records to *.bre records. So a duplicate *.bre record would be skipped, > along with any related records in other files. I wonder how to do this? >
There are command line options meant to help with that by allowing you to supply a file that contains the list of TCNs that are spoken for, and to offset the start of a particular files IDs by a set amount, but since I just committed Dan's new version of marc2bre.pl I would recommend you grab a copy of that. It's more explicit in the options, generally cleaner and all around better than the old (serviceable, but cantankerous) version that has evolved over the last 3 years or so. The copy in Dan's recent email is the same as what was committed to trunk, minus a few formatting changes (more of which are needed ... it was decided to move to space-indenting instead of tabs, and many files are still suffering). Take a look inside the new script at the option comments. I believe a combination of idfield/idsubfield and tcnfield/tcnsubfield should be all you need, assuming that the tcn's are unique in the sources system. You should be able to use the same value for idfield and tcnfield. -- Mike Rylander | VP, Research and Design | Equinox Software, Inc. / The Evergreen Experts | phone: 1-877-OPEN-ILS (673-6457) | email: [EMAIL PROTECTED] | web: http://www.esilibrary.com
