Re: [CODE4LIB] MARC reporting engine
Hi Stuart, Author of marctools[1] here – if you have any feature request that would help you with your processing, please don't hesitate to open an issue on github. Originally, we wrote marctools to convert MARC to JSON and then index[2] the output into elasticsearch[3] for random access, query and analysis. Best, Martin [1] https://github.com/ubleipzig/marctools [2] https://github.com/miku/esbulk [3] http://www.elasticsearch.org/ On Tue, Nov 4, 2014 at 12:43 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Hm. You don't need to keep all 800k records in memory, you just need to keep the data you need in memory, right? I'd keep a hash keyed by authorized heading, with the values I need there. I don't think you'll have trouble keeping such a hash in memory, for a batch process run manually once in a while -- modern OS's do a great job with virtual memory making it invisible (but slower) when you use more memory than you have physically, if it comes to that, which it may not. If you do, you could keep the data you need in the data store of your choice, such as a local DBM database, which ruby/python/perl will all let you do pretty painlessly, accessing a hash-like data structure which is actually stored on disk not in memory but which you access more or less the same as an in-memory hash. But, yes, it will require some programming, for sure. A MARC Indexer can mean many things, and I'm not sure you need one here, but as it happens I have built something you could describe as a MARC Indexer, and I guess it wasn't exactly straightforward, it's true. I'm not sure it's of any use to you here for your use case, but you can check it out at https://github.com/traject-project/traject On 11/2/14 9:29 PM, Stuart Yeates wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. cheers stuart -- I have a new phone number: 04 463 5692 From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Jonathan Rochkind rochk...@jhu.edu Sent: Monday, 3 November 2014 1:24 p.m. To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting engine If you are, can become, or know, a programmer, that would be relatively straightforward in any programming language using the open source MARC processing library for that language. (ruby marc, pymarc, perl marc, whatever). Although you might find more trouble than you expect around authorities, with them being less standardized in your corpus than you might like. From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates [stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine I have ~800,000 MARC records from an indexing service (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to generate: (a) a list of person authorities (and sundry metadata), sorted by how many times they're referenced, in wikimedia syntax (b) a view of a person authority, with all the records by which they're referenced, processed into a wikipedia stub biography I have established that this is too much data to process in XSLT or multi-line regexps in vi. What other MARC engines are there out there? The two options I'm aware of are learning multi-line processing in sed or learning enough koha to write reports in whatever their reporting engine is. Any advice? cheers stuart -- I have a new phone number: 04 463 5692
Re: [CODE4LIB] MARC reporting engine
Stuart Yeates wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. you could try marcdb[1] from marctools[2] [1] https://github.com/ubleipzig/marctools#marcdb [2] https://github.com/ubleipzig/marctools -- raffaele
Re: [CODE4LIB] MARC reporting engine
I’m surprised you didn’t recommend going straight to Solr and doing the reporting from there :) Index into Solr using your MARC library of choice (e.g. solrmarc) and then get all authorities using facet.field=authorities (or whatever field name used). Erik On Nov 2, 2014, at 7:24 PM, Jonathan Rochkind rochk...@jhu.edu wrote: If you are, can become, or know, a programmer, that would be relatively straightforward in any programming language using the open source MARC processing library for that language. (ruby marc, pymarc, perl marc, whatever). Although you might find more trouble than you expect around authorities, with them being less standardized in your corpus than you might like. From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates [stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine I have ~800,000 MARC records from an indexing service (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to generate: (a) a list of person authorities (and sundry metadata), sorted by how many times they're referenced, in wikimedia syntax (b) a view of a person authority, with all the records by which they're referenced, processed into a wikipedia stub biography I have established that this is too much data to process in XSLT or multi-line regexps in vi. What other MARC engines are there out there? The two options I'm aware of are learning multi-line processing in sed or learning enough koha to write reports in whatever their reporting engine is. Any advice? cheers stuart -- I have a new phone number: 04 463 5692
Re: [CODE4LIB] MARC reporting engine
On Sun, Nov 2, 2014 at 6:29 PM, Stuart Yeates stuart.yea...@vuw.ac.nz wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. Unless I'm missing something, this task is easier than it sounds. Since you are interested in only a small part of the record, the memory requirements are quite modest so you can absolutely fit it all into memory while processing the file one line at a time. If I understand your problem correctly, a hash of arrays or objects would make short work of this. One handy programming reference for people who need syntax for a variety of commonly used tasks (i.e. practically everything you would normally need) in more than 30 languages is PLEAC (Programming Language Examples Alike Cookbook) http://pleac.sourceforge.net/ kyle
Re: [CODE4LIB] MARC reporting engine
On Nov 2, 2014, at 9:29 PM, Stuart Yeates stuart.yea...@vuw.ac.nzmailto:stuart.yea...@vuw.ac.nz wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. I think the XMLDB idea is the way to go but I’d use Basex (http://basex.org). Basex has query and indexing capabilities, If you know XSLT (and SQL) then you’d at least have a start with Xquery. —Brian
Re: [CODE4LIB] MARC reporting engine
Thank you to all who responded with software suggestions. https://github.com/ubleipzig/marctools is looking like the most promising candidate so far. The more I read through the recommendations the more it dawned on me that I don't want to have to configure yet another java toolchain (yes I know, that may be personal bias). Thank you to all who responded about the challenges of authority control in such collections. I'm aware of these issues. The current project is about marshalling resources for editors to make informed decisions about rather than automating the creation of articles, because there is human judgement involved in the last step I can afford to take a few authority control 'risks' cheers stuart -- I have a new phone number: 04 463 5692 From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of raffaele messuti raffaele.mess...@gmail.com Sent: Monday, 3 November 2014 11:39 p.m. To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting engine Stuart Yeates wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. you could try marcdb[1] from marctools[2] [1] https://github.com/ubleipzig/marctools#marcdb [2] https://github.com/ubleipzig/marctools -- raffaele
Re: [CODE4LIB] MARC reporting engine
Apologies, I should have used Plain English for an international audience. 'Sundry' means 'miscellaneous' or 'other' Ideally for each person, I'd generate a range of date for mentions, a check to see whether they had obituaries in the index, I'll also generate URLs into the search engines for various external systems (worldcat, VIAF, ORCID, digitalnz, etc) because these are useful to the editor who makes the decisions about using the content to make the wikipedia stub. cheers stuart -- I have a new phone number: 04 463 5692 From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Jean-Claude Dauphin jc.daup...@gmail.com Sent: Tuesday, 4 November 2014 7:40 a.m. To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting engine Hi Stuart, I made some experiments with the innz-metadata in J-ISIS software, and you may be interested to read the summary which is attached. Thank you for informing the CODE4LIB list about the innz-metadata dataset, this is very useful for testing and improving J-ISIS. But now, I would like to see if it's easy to do what you wish to achieve with J-ISIS. Please excuse my ignorance, but could you please explain on which MARC fields or subfields you wish to extract the person authorities and explain me what are the sundry metadata and how they are related to MARC records. I googled about sundry metadata but didn't found any satisfactory information Best wishes, Jean-Claude On Mon, Nov 3, 2014 at 4:24 PM, Brian Kennison kennis...@wcsu.edu wrote: On Nov 2, 2014, at 9:29 PM, Stuart Yeates stuart.yea...@vuw.ac.nzmailto: stuart.yea...@vuw.ac.nz wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. I think the XMLDB idea is the way to go but I’d use Basex ( http://basex.org). Basex has query and indexing capabilities, If you know XSLT (and SQL) then you’d at least have a start with Xquery. —Brian -- Jean-Claude Dauphin jc.daup...@gmail.com jc.daup...@afus.unesco.org http://kenai.com/projects/j-isis/ http://www.unesco.org/isis/ http://www.unesco.org/idams/ http://www.greenstone.org
Re: [CODE4LIB] MARC reporting engine
I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc, since I'm the creator of SolrMarc. It does provide many of the same tools as are described in the toolset you linked to, but it is designed to write to Solr rather than to a SQL style database. Solr may or may not be more suitable for your needs then a SQL database. However I decided to download the data to see whether SolrMarc could handle it. I started with the MARCXML.gz data, ungzipped it to get a .XML file, but the resulting file causes SolrMarc to blow chunks. Either I'm missing something or there is something way wrong with that data.The gzipped binary MARC file work fine with the SolrMarc tools. Creating a SolrMarc script to extract the 700 fields, plus a bash script to cluster and count them, and sort by frequency took about 20 minutes. -Bob Haschart On 11/3/2014 3:00 PM, Stuart Yeates wrote: Thank you to all who responded with software suggestions. https://github.com/ubleipzig/marctools is looking like the most promising candidate so far. The more I read through the recommendations the more it dawned on me that I don't want to have to configure yet another java toolchain (yes I know, that may be personal bias). Thank you to all who responded about the challenges of authority control in such collections. I'm aware of these issues. The current project is about marshalling resources for editors to make informed decisions about rather than automating the creation of articles, because there is human judgement involved in the last step I can afford to take a few authority control 'risks' cheers stuart -- I have a new phone number: 04 463 5692 From: Code for LibrariesCODE4LIB@LISTSERV.ND.EDU on behalf of raffaele messutiraffaele.mess...@gmail.com Sent: Monday, 3 November 2014 11:39 p.m. To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting engine Stuart Yeates wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. you could try marcdb[1] from marctools[2] [1] https://github.com/ubleipzig/marctools#marcdb [2] https://github.com/ubleipzig/marctools -- raffaele
Re: [CODE4LIB] MARC reporting engine
The MARC XML seemed to be an archive within an archive - I had to gunzip to get innzmetadata.xml then rename to innzmetadata.xml.gz and gunzip again to get the actual xml Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 3 Nov 2014, at 22:38, Robert Haschart rh...@virginia.edu wrote: I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc, since I'm the creator of SolrMarc. It does provide many of the same tools as are described in the toolset you linked to, but it is designed to write to Solr rather than to a SQL style database. Solr may or may not be more suitable for your needs then a SQL database. However I decided to download the data to see whether SolrMarc could handle it. I started with the MARCXML.gz data, ungzipped it to get a .XML file, but the resulting file causes SolrMarc to blow chunks. Either I'm missing something or there is something way wrong with that data.The gzipped binary MARC file work fine with the SolrMarc tools. Creating a SolrMarc script to extract the 700 fields, plus a bash script to cluster and count them, and sort by frequency took about 20 minutes. -Bob Haschart On 11/3/2014 3:00 PM, Stuart Yeates wrote: Thank you to all who responded with software suggestions. https://github.com/ubleipzig/marctools is looking like the most promising candidate so far. The more I read through the recommendations the more it dawned on me that I don't want to have to configure yet another java toolchain (yes I know, that may be personal bias). Thank you to all who responded about the challenges of authority control in such collections. I'm aware of these issues. The current project is about marshalling resources for editors to make informed decisions about rather than automating the creation of articles, because there is human judgement involved in the last step I can afford to take a few authority control 'risks' cheers stuart -- I have a new phone number: 04 463 5692 From: Code for LibrariesCODE4LIB@LISTSERV.ND.EDU on behalf of raffaele messutiraffaele.mess...@gmail.com Sent: Monday, 3 November 2014 11:39 p.m. To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting engine Stuart Yeates wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. you could try marcdb[1] from marctools[2] [1] https://github.com/ubleipzig/marctools#marcdb [2] https://github.com/ubleipzig/marctools -- raffaele
Re: [CODE4LIB] MARC reporting engine
Hm. You don't need to keep all 800k records in memory, you just need to keep the data you need in memory, right? I'd keep a hash keyed by authorized heading, with the values I need there. I don't think you'll have trouble keeping such a hash in memory, for a batch process run manually once in a while -- modern OS's do a great job with virtual memory making it invisible (but slower) when you use more memory than you have physically, if it comes to that, which it may not. If you do, you could keep the data you need in the data store of your choice, such as a local DBM database, which ruby/python/perl will all let you do pretty painlessly, accessing a hash-like data structure which is actually stored on disk not in memory but which you access more or less the same as an in-memory hash. But, yes, it will require some programming, for sure. A MARC Indexer can mean many things, and I'm not sure you need one here, but as it happens I have built something you could describe as a MARC Indexer, and I guess it wasn't exactly straightforward, it's true. I'm not sure it's of any use to you here for your use case, but you can check it out at https://github.com/traject-project/traject On 11/2/14 9:29 PM, Stuart Yeates wrote: Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. cheers stuart -- I have a new phone number: 04 463 5692 From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Jonathan Rochkind rochk...@jhu.edu Sent: Monday, 3 November 2014 1:24 p.m. To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting engine If you are, can become, or know, a programmer, that would be relatively straightforward in any programming language using the open source MARC processing library for that language. (ruby marc, pymarc, perl marc, whatever). Although you might find more trouble than you expect around authorities, with them being less standardized in your corpus than you might like. From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates [stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine I have ~800,000 MARC records from an indexing service (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to generate: (a) a list of person authorities (and sundry metadata), sorted by how many times they're referenced, in wikimedia syntax (b) a view of a person authority, with all the records by which they're referenced, processed into a wikipedia stub biography I have established that this is too much data to process in XSLT or multi-line regexps in vi. What other MARC engines are there out there? The two options I'm aware of are learning multi-line processing in sed or learning enough koha to write reports in whatever their reporting engine is. Any advice? cheers stuart -- I have a new phone number: 04 463 5692
Re: [CODE4LIB] MARC reporting engine
If you are, can become, or know, a programmer, that would be relatively straightforward in any programming language using the open source MARC processing library for that language. (ruby marc, pymarc, perl marc, whatever). Although you might find more trouble than you expect around authorities, with them being less standardized in your corpus than you might like. From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates [stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine I have ~800,000 MARC records from an indexing service (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to generate: (a) a list of person authorities (and sundry metadata), sorted by how many times they're referenced, in wikimedia syntax (b) a view of a person authority, with all the records by which they're referenced, processed into a wikipedia stub biography I have established that this is too much data to process in XSLT or multi-line regexps in vi. What other MARC engines are there out there? The two options I'm aware of are learning multi-line processing in sed or learning enough koha to write reports in whatever their reporting engine is. Any advice? cheers stuart -- I have a new phone number: 04 463 5692
Re: [CODE4LIB] MARC reporting engine
It looks like the dataset is available in XML format. Perhaps you can import it into an XML database (eXist - exist-db.org comes to mind), and then generate a report via its query capabilities. Miles Fidelman Jonathan Rochkind wrote: If you are, can become, or know, a programmer, that would be relatively straightforward in any programming language using the open source MARC processing library for that language. (ruby marc, pymarc, perl marc, whatever). Although you might find more trouble than you expect around authorities, with them being less standardized in your corpus than you might like. From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates [stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine I have ~800,000 MARC records from an indexing service (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to generate: (a) a list of person authorities (and sundry metadata), sorted by how many times they're referenced, in wikimedia syntax (b) a view of a person authority, with all the records by which they're referenced, processed into a wikipedia stub biography I have established that this is too much data to process in XSLT or multi-line regexps in vi. What other MARC engines are there out there? The two options I'm aware of are learning multi-line processing in sed or learning enough koha to write reports in whatever their reporting engine is. Any advice? cheers stuart -- I have a new phone number: 04 463 5692 -- In theory, there is no difference between theory and practice. In practice, there is. Yogi Berra
Re: [CODE4LIB] MARC reporting engine
Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am. cheers stuart -- I have a new phone number: 04 463 5692 From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Jonathan Rochkind rochk...@jhu.edu Sent: Monday, 3 November 2014 1:24 p.m. To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting engine If you are, can become, or know, a programmer, that would be relatively straightforward in any programming language using the open source MARC processing library for that language. (ruby marc, pymarc, perl marc, whatever). Although you might find more trouble than you expect around authorities, with them being less standardized in your corpus than you might like. From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates [stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine I have ~800,000 MARC records from an indexing service (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to generate: (a) a list of person authorities (and sundry metadata), sorted by how many times they're referenced, in wikimedia syntax (b) a view of a person authority, with all the records by which they're referenced, processed into a wikipedia stub biography I have established that this is too much data to process in XSLT or multi-line regexps in vi. What other MARC engines are there out there? The two options I'm aware of are learning multi-line processing in sed or learning enough koha to write reports in whatever their reporting engine is. Any advice? cheers stuart -- I have a new phone number: 04 463 5692