I'd also consider using a document db (e.g. MongoDb) with the marc-in-JSON format for this.
You could run jsonpath queries or map/reduce to get your answers. Mongo runs best in memory, but I think you'll be fine since you don't need immediate answers. -Ross. On Wednesday, February 27, 2013, Andy Kohler wrote: > I agree with Terry: use a database. Since you're doing multiple queries, > invest the time up front to import your data in a queryable format, with > indexes, instead of repeatedly building comparison files. > > But of course, it depends... dealing with large amounts of data efficiently > is often best done with lots of memory. But if you can run mysql and the > lengthy up-front parsing/loading/indexing of the records is acceptable, go > for it. > > For what it's worth, I have done something similar for many years, where I > build a database with all of our MARC records, parsed down to the subfield > level. It's great for queries like "find me all the records with XYZ in > one subfield and ABC in another" or "find all of the duplicate OCLC > numbers". It's not so great if you need to output the original field in a > report (though it can be rebuilt from the subfields). > > Here's the Oracle table I use: > CREATE TABLE bib_subfield > ( record_id INT NOT NULL > , field_seq INT NOT NULL > , subfield_seq INT NOT NULL > , indicators CHAR(2) NULL > , tag CHAR(4) NOT NULL > , subfield NVARCHAR2(4000) NULL > ) > ; > > Our MARC data is Unicode, thus the NVARCHAR. Super-long subfields like > some 5xx notes do get truncated but that's a tiny fraction of a percentage > of data lost, a fair tradeoff for our needs. > > field_seq and subfield_seq are numbers tracking the ordinal position of > each field within the record, and each subfield within a field, for those > occasional queries wanting data from the "first" 650 field, or subfields > which aren't in the "correct" order per catalogers. You may not need that > level of detail. > > Another, completely unrelated, possible solution depending on your needs: > run the records through solrmarc and do your queries via solr? > > Good luck... let us know what you eventually decide to do. > > --Andy > > On Wed, Feb 27, 2013 at 9:53 AM, Reese, Terry > <terry.re...@oregonstate.edu <javascript:;>>wrote: > > > Kyle -- if this was me -- I'd break the file into a database. You have a > > lot of different options, but the last time I had to do something like > this > > -- I broke the data into 10 tables -- a control table with a primary key > > and oclc number, a table for 0xx fields, a table for 1xx, 2xx, etc. > > including OCLC number and key that they relate too. You can actually do > > this with MarcEdit (if you have mysql installed) -- but on a laptop -- > I'm > > not going to guarantee speed with the process. Plus, the process to > > generate the SQL data will be significant. It might take 15 hours to > > generate the database, but then you'd have it and could create indexes on > > it. But you could use it to create the database and then prep the files > > for later work. > > > > --TR > > >