I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc, since I'm the creator of SolrMarc. It does provide many of the same tools as are described in the toolset you linked to, but it is designed to write to Solr rather than to a SQL style database. Solr may or may not be more suitable for your needs then a SQL database. However I decided to download the data to see whether SolrMarc could handle it. I started with the MARCXML.gz data, ungzipped it to get a .XML file, but the resulting file causes SolrMarc to blow chunks. Either I'm missing something or there is something way wrong with that data. The gzipped binary MARC file work fine with the SolrMarc tools.

Creating a SolrMarc script to extract the 700 fields, plus a bash script to cluster and count them, and sort by frequency took about 20 minutes.

-Bob Haschart


On 11/3/2014 3:00 PM, Stuart Yeates wrote:
Thank you to all who responded with software suggestions. 
https://github.com/ubleipzig/marctools is looking like the most promising 
candidate so far. The more I read through the recommendations the more it 
dawned on me that I don't want to have to configure yet another java toolchain 
(yes I know, that may be personal bias).

Thank you to all who responded about the challenges of authority control in 
such collections. I'm aware of these issues. The current project is about 
marshalling resources for editors to make informed decisions about rather than 
automating the creation of articles, because there is human judgement involved 
in the last step I can afford to take a few authority control 'risks'

cheers
stuart

--
I have a new phone number: 04 463 5692

________________________________________
From: Code for Libraries<CODE4LIB@LISTSERV.ND.EDU>  on behalf of raffaele 
messuti<raffaele.mess...@gmail.com>
Sent: Monday, 3 November 2014 11:39 p.m.
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC reporting engine

Stuart Yeates wrote:
Do any of these have built-in indexing? 800k records isn't going to fit in 
memory and if building my own MARC indexer is 'relatively straightforward' then 
you're a better coder than I am.
you could try marcdb[1] from marctools[2]

[1] https://github.com/ubleipzig/marctools#marcdb
[2] https://github.com/ubleipzig/marctools


--
raffaele

Reply via email to