[Since you're getting good performance using a relational database, these
may not be necessary for you, but since I've been looking at some of the
tricks I've used in my own code to see how they can be fitted into the
revived marc4j project, I thought I'd write them down]
If the Tennant/Dean
On Wed, Feb 27, 2013 at 1:42 PM, Andy Kohler akoh...@ucla.edu wrote:
I agree with Terry: use a database. Since you're doing multiple queries,
invest the time up front to import your data in a queryable format, with
indexes, instead of repeatedly building comparison files...
Another,
I'm involved in a migration project that requires identification of local
information in millions of MARC records.
The master records I need to compare with are 14GB total. I don't know what
the others will be, but since the masters are deduped and the source files
aren't (plus they contain loads
27, 2013 9:45 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently
I'm involved in a migration project that requires identification of local
information in millions of MARC records.
The master records I need to compare with are 14GB total
I agree with Terry: use a database. Since you're doing multiple queries,
invest the time up front to import your data in a queryable format, with
indexes, instead of repeatedly building comparison files.
But of course, it depends... dealing with large amounts of data efficiently
is often best
I'd also consider using a document db (e.g. MongoDb) with the marc-in-JSON
format for this.
You could run jsonpath queries or map/reduce to get your answers.
Mongo runs best in memory, but I think you'll be fine since you don't need
immediate answers.
-Ross.
On Wednesday, February 27, 2013,