Kyle -- if this was me -- I'd break the file into a database.  You have a lot 
of different options, but the last time I had to do something like this -- I 
broke the data into 10 tables -- a control table with a primary key and oclc 
number, a table for 0xx fields, a table for 1xx, 2xx, etc.  including OCLC 
number and key that they relate too.  You can actually do this with MarcEdit 
(if you have mysql installed) -- but on a laptop -- I'm not going to guarantee 
speed with the process.  Plus, the process to generate the SQL data will be 
significant.  It might take 15 hours to generate the database, but then you'd 
have it and could create indexes on it.  But you could use it to create the 
database and then prep the files for later work.

--TR

-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of Kyle 
Banerjee
Sent: Wednesday, February 27, 2013 9:45 AM
To: [email protected]
Subject: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

I'm involved in a migration project that requires identification of local 
information in millions of MARC records.

The master records I need to compare with are 14GB total. I don't know what the 
others will be, but since the masters are deduped and the source files aren't 
(plus they contain loads of other garbage), there will be considerably more. 
Roughly speaking, if I compare 1000 master records per second, it would take 
about 2 1/2 hours to cut through the file. I need to be able to ask the file 
whatever questions the librarians might have (i.e.
many), so speed is important.

For reasons I won't go into right now, I'm stuck doing this on my laptop in 
cygwin right now and that affects my range of motion.

I'm trying to figure out the best way to proceed. Currently, I'm extracting 
specific fields for comparison. Each field tag gets a single line keyed by OCLC 
number (repeated fields are catted together with a delimiter). The idea is that 
if I deal with only one field at a time, I can slurp the master info in memory 
and retrieve it via hash (OCLC control number) as I loop through the comparison 
data. Local data will either be stored in special files that are loaded 
separately from the bibs or recorded in reports for maintenance projects

This process is clunky because a special comparison file has to be created for 
each question, but it does seem to work (generating preprocess files and then 
doing the compare is measured in minutes rather than hours). I didn't use a DB 
because there's no way I could store the reference data in memory and I figured 
I'd just thrash my drive.

Is this a reasonable approach, and whether or not it is, what tools should I be 
thinking of using for this? Thanks,

kyle

Reply via email to