The MARC XML seemed to be an archive within an archive - I had to gunzip to get innzmetadata.xml then rename to innzmetadata.xml.gz and gunzip again to get the actual xml
Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 > On 3 Nov 2014, at 22:38, Robert Haschart <rh...@virginia.edu> wrote: > > I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc, since > I'm the creator of SolrMarc. > It does provide many of the same tools as are described in the toolset you > linked to, but it is designed to write to Solr rather than to a SQL style > database. Solr may or may not be more suitable for your needs then a SQL > database. However I decided to download the data to see whether SolrMarc > could handle it. I started with the MARCXML.gz data, ungzipped it to get a > .XML file, but the resulting file causes SolrMarc to blow chunks. Either > I'm missing something or there is something way wrong with that data. The > gzipped binary MARC file work fine with the SolrMarc tools. > > Creating a SolrMarc script to extract the 700 fields, plus a bash script to > cluster and count them, and sort by frequency took about 20 minutes. > > -Bob Haschart > > > On 11/3/2014 3:00 PM, Stuart Yeates wrote: >> Thank you to all who responded with software suggestions. >> https://github.com/ubleipzig/marctools is looking like the most promising >> candidate so far. The more I read through the recommendations the more it >> dawned on me that I don't want to have to configure yet another java >> toolchain (yes I know, that may be personal bias). >> >> Thank you to all who responded about the challenges of authority control in >> such collections. I'm aware of these issues. The current project is about >> marshalling resources for editors to make informed decisions about rather >> than automating the creation of articles, because there is human judgement >> involved in the last step I can afford to take a few authority control >> 'risks' >> >> cheers >> stuart >> >> -- >> I have a new phone number: 04 463 5692 >> >> ________________________________________ >> From: Code for Libraries<CODE4LIB@LISTSERV.ND.EDU> on behalf of raffaele >> messuti<raffaele.mess...@gmail.com> >> Sent: Monday, 3 November 2014 11:39 p.m. >> To: CODE4LIB@LISTSERV.ND.EDU >> Subject: Re: [CODE4LIB] MARC reporting engine >> >> Stuart Yeates wrote: >>> Do any of these have built-in indexing? 800k records isn't going to fit in >>> memory and if building my own MARC indexer is 'relatively straightforward' >>> then you're a better coder than I am. >> you could try marcdb[1] from marctools[2] >> >> [1] https://github.com/ubleipzig/marctools#marcdb >> [2] https://github.com/ubleipzig/marctools >> >> >> -- >> raffaele