combining large amounts of data efficiently

Ross Singer Wed, 27 Feb 2013 16:44:52 -0800

I'd also consider using a document db (e.g. MongoDb) with the marc-in-JSON
format for this.


You could run jsonpath queries or map/reduce to get your answers.

Mongo runs best in memory, but I think you'll be fine since you don't need
immediate answers.

-Ross.

On Wednesday, February 27, 2013, Andy Kohler wrote:

> I agree with Terry: use a database.  Since you're doing multiple queries,
> invest the time up front to import your data in a queryable format, with
> indexes, instead of repeatedly building comparison files.
>
> But of course, it depends... dealing with large amounts of data efficiently
> is often best done with lots of memory.  But if you can run mysql and the
> lengthy up-front parsing/loading/indexing of the records is acceptable, go
> for it.
>
> For what it's worth, I have done something similar for many years, where I
> build a database with all of our MARC records, parsed down to the subfield
> level.  It's great for queries like "find me all the records with XYZ in
> one subfield and ABC in another" or "find all of the duplicate OCLC
> numbers".  It's not so great if you need to output the original field in a
> report (though it can be rebuilt from the subfields).
>
> Here's the Oracle table I use:
> CREATE TABLE bib_subfield
> (    record_id INT NOT NULL
> ,    field_seq INT NOT NULL
> ,    subfield_seq INT NOT NULL
> ,    indicators CHAR(2) NULL
> ,    tag CHAR(4) NOT NULL
> ,    subfield NVARCHAR2(4000) NULL
> )
> ;
>
> Our MARC data is Unicode, thus the NVARCHAR.  Super-long subfields like
> some 5xx notes do get truncated but that's a tiny fraction of a percentage
> of data lost, a fair tradeoff for our needs.
>
> field_seq and subfield_seq are numbers tracking the ordinal position of
> each field within the record, and each subfield within a field, for those
> occasional queries wanting data from the "first" 650 field, or subfields
> which aren't in the "correct" order per catalogers.  You may not need that
> level of detail.
>
> Another, completely unrelated, possible solution depending on your needs:
> run the records through solrmarc and do your queries via solr?
>
> Good luck... let us know what you eventually decide to do.
>
> --Andy
>
> On Wed, Feb 27, 2013 at 9:53 AM, Reese, Terry
> <terry.re...@oregonstate.edu <javascript:;>>wrote:
>
> > Kyle -- if this was me -- I'd break the file into a database.  You have a
> > lot of different options, but the last time I had to do something like
> this
> > -- I broke the data into 10 tables -- a control table with a primary key
> > and oclc number, a table for 0xx fields, a table for 1xx, 2xx, etc.
> >  including OCLC number and key that they relate too.  You can actually do
> > this with MarcEdit (if you have mysql installed) -- but on a laptop --
> I'm
> > not going to guarantee speed with the process.  Plus, the process to
> > generate the SQL data will be significant.  It might take 15 hours to
> > generate the database, but then you'd have it and could create indexes on
> > it.  But you could use it to create the database and then prep the files
> > for later work.
> >
> > --TR
> >
>

Re: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

Reply via email to