I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc,
since I'm the creator of SolrMarc.
It does provide many of the same tools as are described in the toolset
you linked to, but it is designed to write to Solr rather than to a SQL
style database. Solr may or may not be more suitable for your needs
then a SQL database. However I decided to download the data to see
whether SolrMarc could handle it. I started with the MARCXML.gz data,
ungzipped it to get a .XML file, but the resulting file causes SolrMarc
to blow chunks. Either I'm missing something or there is something way
wrong with that data. The gzipped binary MARC file work fine with the
SolrMarc tools.
Creating a SolrMarc script to extract the 700 fields, plus a bash script
to cluster and count them, and sort by frequency took about 20 minutes.
-Bob Haschart
On 11/3/2014 3:00 PM, Stuart Yeates wrote:
Thank you to all who responded with software suggestions.
https://github.com/ubleipzig/marctools is looking like the most promising
candidate so far. The more I read through the recommendations the more it
dawned on me that I don't want to have to configure yet another java toolchain
(yes I know, that may be personal bias).
Thank you to all who responded about the challenges of authority control in
such collections. I'm aware of these issues. The current project is about
marshalling resources for editors to make informed decisions about rather than
automating the creation of articles, because there is human judgement involved
in the last step I can afford to take a few authority control 'risks'
cheers
stuart
--
I have a new phone number: 04 463 5692
________________________________________
From: Code for Libraries<CODE4LIB@LISTSERV.ND.EDU> on behalf of raffaele
messuti<raffaele.mess...@gmail.com>
Sent: Monday, 3 November 2014 11:39 p.m.
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC reporting engine
Stuart Yeates wrote:
Do any of these have built-in indexing? 800k records isn't going to fit in
memory and if building my own MARC indexer is 'relatively straightforward' then
you're a better coder than I am.
you could try marcdb[1] from marctools[2]
[1] https://github.com/ubleipzig/marctools#marcdb
[2] https://github.com/ubleipzig/marctools
--
raffaele