Hm. You don't need to keep all 800k records in memory, you just need to keep the data you need in memory, right? I'd keep a hash keyed by authorized heading, with the values I need there.

I don't think you'll have trouble keeping such a hash in memory, for a batch process run manually once in a while -- modern OS's do a great job with virtual memory making it invisible (but slower) when you use more memory than you have physically, if it comes to that, which it may not.

If you do, you could keep the data you need in the data store of your choice, such as a local DBM database, which ruby/python/perl will all let you do pretty painlessly, accessing a hash-like data structure which is actually stored on disk not in memory but which you access more or less the same as an in-memory hash.

But, yes, it will require some programming, for sure.

A "MARC Indexer" can mean many things, and I'm not sure you need one here, but as it happens I have built something you could describe as a "MARC Indexer", and I guess it wasn't exactly straightforward, it's true. I'm not sure it's of any use to you here for your use case, but you can check it out at https://github.com/traject-project/traject

On 11/2/14 9:29 PM, Stuart Yeates wrote:
Do any of these have built-in indexing? 800k records isn't going to
fit in memory and if building my own MARC indexer is 'relatively
straightforward' then you're a better coder than I am.

cheers stuart

-- I have a new phone number: 04 463 5692

________________________________________ From: Code for Libraries
<CODE4LIB@LISTSERV.ND.EDU> on behalf of Jonathan Rochkind
<rochk...@jhu.edu> Sent: Monday, 3 November 2014 1:24 p.m. To:
CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting
engine

If you are, can become, or know, a programmer, that would be
relatively straightforward in any programming language using the open
source MARC processing library for that language. (ruby marc, pymarc,
perl marc, whatever).

Although you might find more trouble than you expect around
authorities, with them being less standardized in your corpus than
you might like. ________________________________________ From: Code
for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates
[stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To:
CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine

I have ~800,000 MARC records from an indexing service
(http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am
trying to generate:

(a) a list of person authorities (and sundry metadata), sorted by how
many times they're referenced, in wikimedia syntax

(b) a view of a person authority, with all the records by which
they're referenced, processed into a wikipedia stub biography

I have established that this is too much data to process in XSLT or
multi-line regexps in vi. What other MARC engines are there out
there?

The two options I'm aware of are learning multi-line processing in
sed or learning enough koha to write reports in whatever their
reporting engine is.

Any advice?

cheers stuart -- I have a new phone number: 04 463 5692


Reply via email to