Re: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

2013-03-06 Thread Simon Spero
[Since you're getting good performance using a relational database, these may not be necessary for you, but since I've been looking at some of the tricks I've used in my own code to see how they can be fitted into the revived marc4j project, I thought I'd write them down] If the Tennant/Dean

Re: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

2013-03-04 Thread Kyle Banerjee
On Wed, Feb 27, 2013 at 1:42 PM, Andy Kohler akoh...@ucla.edu wrote: I agree with Terry: use a database. Since you're doing multiple queries, invest the time up front to import your data in a queryable format, with indexes, instead of repeatedly building comparison files... Another,

[CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

2013-02-27 Thread Kyle Banerjee
I'm involved in a migration project that requires identification of local information in millions of MARC records. The master records I need to compare with are 14GB total. I don't know what the others will be, but since the masters are deduped and the source files aren't (plus they contain loads

Re: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

2013-02-27 Thread Reese, Terry
27, 2013 9:45 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently I'm involved in a migration project that requires identification of local information in millions of MARC records. The master records I need to compare with are 14GB total

Re: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

2013-02-27 Thread Andy Kohler
I agree with Terry: use a database. Since you're doing multiple queries, invest the time up front to import your data in a queryable format, with indexes, instead of repeatedly building comparison files. But of course, it depends... dealing with large amounts of data efficiently is often best

Re: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

2013-02-27 Thread Ross Singer
I'd also consider using a document db (e.g. MongoDb) with the marc-in-JSON format for this. You could run jsonpath queries or map/reduce to get your answers. Mongo runs best in memory, but I think you'll be fine since you don't need immediate answers. -Ross. On Wednesday, February 27, 2013,