Re: [CODE4LIB] MARC reporting engine

2014-11-04 Thread Martin Czygan
Hi Stuart,

Author of marctools[1] here – if you have any feature request that
would help you with your processing, please don't hesitate to open an
issue on github.

Originally, we wrote marctools to convert MARC to JSON and then
index[2] the output into elasticsearch[3] for random access, query and analysis.

Best,
Martin




[1] https://github.com/ubleipzig/marctools
[2] https://github.com/miku/esbulk
[3] http://www.elasticsearch.org/

On Tue, Nov 4, 2014 at 12:43 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Hm. You don't need to keep all 800k records in memory, you just need to keep
 the data you need in memory, right? I'd keep a hash keyed by authorized
 heading, with the values I need there.

 I don't think you'll have trouble keeping such a hash in memory, for a batch
 process run manually once in a while -- modern OS's do a great job with
 virtual memory making it invisible (but slower) when you use more memory
 than you have physically, if it comes to that, which it may not.

 If you do, you could keep the data you need in the data store of your
 choice, such as a local DBM database, which ruby/python/perl will all let
 you do pretty painlessly, accessing a hash-like data structure which is
 actually stored on disk not in memory but which you access more or less the
 same as an in-memory hash.

 But, yes, it will require some programming, for sure.

 A MARC Indexer can mean many things, and I'm not sure you need one here,
 but as it happens I have built something you could describe as a MARC
 Indexer, and I guess it wasn't exactly straightforward, it's true. I'm not
 sure it's of any use to you here for your use case, but you can check it out
 at https://github.com/traject-project/traject


 On 11/2/14 9:29 PM, Stuart Yeates wrote:

 Do any of these have built-in indexing? 800k records isn't going to
 fit in memory and if building my own MARC indexer is 'relatively
 straightforward' then you're a better coder than I am.

 cheers stuart

 -- I have a new phone number: 04 463 5692

  From: Code for Libraries
 CODE4LIB@LISTSERV.ND.EDU on behalf of Jonathan Rochkind
 rochk...@jhu.edu Sent: Monday, 3 November 2014 1:24 p.m. To:
 CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting
 engine

 If you are, can become, or know, a programmer, that would be
 relatively straightforward in any programming language using the open
 source MARC processing library for that language. (ruby marc, pymarc,
 perl marc, whatever).

 Although you might find more trouble than you expect around
 authorities, with them being less standardized in your corpus than
 you might like.  From: Code
 for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates
 [stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To:
 CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine

 I have ~800,000 MARC records from an indexing service
 (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am
 trying to generate:

 (a) a list of person authorities (and sundry metadata), sorted by how
 many times they're referenced, in wikimedia syntax

 (b) a view of a person authority, with all the records by which
 they're referenced, processed into a wikipedia stub biography

 I have established that this is too much data to process in XSLT or
 multi-line regexps in vi. What other MARC engines are there out
 there?

 The two options I'm aware of are learning multi-line processing in
 sed or learning enough koha to write reports in whatever their
 reporting engine is.

 Any advice?

 cheers stuart -- I have a new phone number: 04 463 5692





Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread raffaele messuti
Stuart Yeates wrote:
 Do any of these have built-in indexing? 800k records isn't going to fit in 
 memory and if building my own MARC indexer is 'relatively straightforward' 
 then you're a better coder than I am. 

you could try marcdb[1] from marctools[2]

[1] https://github.com/ubleipzig/marctools#marcdb
[2] https://github.com/ubleipzig/marctools


--
raffaele


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Erik Hatcher
I’m surprised you didn’t recommend going straight to Solr and doing the 
reporting from there :)   Index into Solr using your MARC library of choice 
(e.g. solrmarc) and then get all authorities using facet.field=authorities (or 
whatever field name used).

Erik



On Nov 2, 2014, at 7:24 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 If you are, can become, or know, a programmer, that would be relatively 
 straightforward in any programming language using the open source MARC 
 processing library for that language. (ruby marc, pymarc, perl marc, 
 whatever).  
 
 Although you might find more trouble than you expect around authorities, with 
 them being less standardized in your corpus than you might like. 
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart 
 Yeates [stuart.yea...@vuw.ac.nz]
 Sent: Sunday, November 02, 2014 5:48 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] MARC reporting engine
 
 I have ~800,000 MARC records from an indexing service 
 (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying 
 to generate:
 
 (a) a list of person authorities (and sundry metadata), sorted by how many 
 times they're referenced, in wikimedia syntax
 
 (b) a view of a person authority, with all the records by which they're 
 referenced, processed into a wikipedia stub biography
 
 I have established that this is too much data to process in XSLT or 
 multi-line regexps in vi. What other MARC engines are there out there?
 
 The two options I'm aware of are learning multi-line processing in sed or 
 learning enough koha to write reports in whatever their reporting engine is.
 
 Any advice?
 
 cheers
 stuart
 --
 I have a new phone number: 04 463 5692


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Kyle Banerjee
On Sun, Nov 2, 2014 at 6:29 PM, Stuart Yeates stuart.yea...@vuw.ac.nz
wrote:

 Do any of these have built-in indexing? 800k records isn't going to fit in
 memory and if building my own MARC indexer is 'relatively straightforward'
 then you're a better coder than I am.


Unless I'm missing something, this task is easier than it sounds. Since you
are interested in only a small part of the record, the memory requirements
are quite modest so you can absolutely fit it all into memory while
processing the file one line at a time. If I understand your problem
correctly, a hash of arrays or objects would make short work of this.

One handy programming reference for people who need syntax for a variety of
commonly used tasks (i.e. practically everything you would normally need)
in more than 30 languages is PLEAC (Programming Language Examples Alike
Cookbook) http://pleac.sourceforge.net/

kyle


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Brian Kennison
On Nov 2, 2014, at 9:29 PM, Stuart Yeates 
stuart.yea...@vuw.ac.nzmailto:stuart.yea...@vuw.ac.nz wrote:

Do any of these have built-in indexing? 800k records isn't going to fit in 
memory and if building my own MARC indexer is 'relatively straightforward' then 
you're a better coder than I am.



I think the XMLDB idea is the way to go but I’d use Basex (http://basex.org). 
Basex has  query and indexing capabilities, If you know XSLT (and SQL) then 
you’d at least have a start with Xquery.

—Brian


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Stuart Yeates
Thank you to all who responded with software suggestions. 
https://github.com/ubleipzig/marctools is looking like the most promising 
candidate so far. The more I read through the recommendations the more it 
dawned on me that I don't want to have to configure yet another java toolchain 
(yes I know, that may be personal bias). 

Thank you to all who responded about the challenges of authority control in 
such collections. I'm aware of these issues. The current project is about 
marshalling resources for editors to make informed decisions about rather than 
automating the creation of articles, because there is human judgement involved 
in the last step I can afford to take a few authority control 'risks'

cheers
stuart

--
I have a new phone number: 04 463 5692


From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of raffaele 
messuti raffaele.mess...@gmail.com
Sent: Monday, 3 November 2014 11:39 p.m.
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC reporting engine

Stuart Yeates wrote:
 Do any of these have built-in indexing? 800k records isn't going to fit in 
 memory and if building my own MARC indexer is 'relatively straightforward' 
 then you're a better coder than I am.

you could try marcdb[1] from marctools[2]

[1] https://github.com/ubleipzig/marctools#marcdb
[2] https://github.com/ubleipzig/marctools


--
raffaele


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Stuart Yeates
Apologies, I should have used Plain English for an international audience. 
'Sundry' means 'miscellaneous' or 'other'

Ideally for each person, I'd generate a range of date for mentions, a check to 
see whether they had obituaries in the index, I'll also generate URLs into the 
search engines for various external systems (worldcat, VIAF, ORCID, digitalnz, 
etc) because these are useful to the editor who makes the decisions about using 
the content to make the wikipedia stub.

cheers
stuart

--
I have a new phone number: 04 463 5692


From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Jean-Claude 
Dauphin jc.daup...@gmail.com
Sent: Tuesday, 4 November 2014 7:40 a.m.
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC reporting engine

Hi Stuart,

I made some experiments with the innz-metadata in J-ISIS software, and you
may be interested to read the summary which is attached. Thank you for
informing the CODE4LIB list about the innz-metadata dataset, this is very
useful for testing and improving J-ISIS.

But now, I would like to see if it's easy to do what you wish to achieve
with J-ISIS. Please excuse my ignorance, but could you please explain on
which MARC fields or subfields you wish to extract the person authorities
and explain me what are the sundry metadata and how they are related to
MARC records. I googled about sundry metadata  but didn't found any
satisfactory information

Best wishes,

Jean-Claude

On Mon, Nov 3, 2014 at 4:24 PM, Brian Kennison kennis...@wcsu.edu wrote:

 On Nov 2, 2014, at 9:29 PM, Stuart Yeates stuart.yea...@vuw.ac.nzmailto:
 stuart.yea...@vuw.ac.nz wrote:

 Do any of these have built-in indexing? 800k records isn't going to fit in
 memory and if building my own MARC indexer is 'relatively straightforward'
 then you're a better coder than I am.



 I think the XMLDB idea is the way to go but I’d use Basex (
 http://basex.org). Basex has  query and indexing capabilities, If you
 know XSLT (and SQL) then you’d at least have a start with Xquery.

 —Brian




--
Jean-Claude Dauphin

jc.daup...@gmail.com
jc.daup...@afus.unesco.org

http://kenai.com/projects/j-isis/
http://www.unesco.org/isis/
http://www.unesco.org/idams/
http://www.greenstone.org


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Robert Haschart
I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc, 
since I'm the creator of SolrMarc.
It does provide many of the same tools as are described in the toolset 
you linked to,  but it is designed to write to Solr rather than to a SQL 
style database.   Solr may or may not be more suitable for your needs 
then a SQL database.   However I decided to download the data to see 
whether SolrMarc could handle it.   I started with the MARCXML.gz data, 
ungzipped it to get a .XML file, but the resulting file causes SolrMarc 
to blow chunks.   Either I'm missing something or there is something way 
wrong with that data.The gzipped binary MARC file work fine with the 
SolrMarc tools.


Creating a SolrMarc script to extract the 700 fields, plus a bash script 
to cluster and count them, and sort by frequency took about 20 minutes.


-Bob Haschart


On 11/3/2014 3:00 PM, Stuart Yeates wrote:

Thank you to all who responded with software suggestions. 
https://github.com/ubleipzig/marctools is looking like the most promising 
candidate so far. The more I read through the recommendations the more it 
dawned on me that I don't want to have to configure yet another java toolchain 
(yes I know, that may be personal bias).

Thank you to all who responded about the challenges of authority control in 
such collections. I'm aware of these issues. The current project is about 
marshalling resources for editors to make informed decisions about rather than 
automating the creation of articles, because there is human judgement involved 
in the last step I can afford to take a few authority control 'risks'

cheers
stuart

--
I have a new phone number: 04 463 5692


From: Code for LibrariesCODE4LIB@LISTSERV.ND.EDU  on behalf of raffaele 
messutiraffaele.mess...@gmail.com
Sent: Monday, 3 November 2014 11:39 p.m.
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC reporting engine

Stuart Yeates wrote:

Do any of these have built-in indexing? 800k records isn't going to fit in 
memory and if building my own MARC indexer is 'relatively straightforward' then 
you're a better coder than I am.

you could try marcdb[1] from marctools[2]

[1] https://github.com/ubleipzig/marctools#marcdb
[2] https://github.com/ubleipzig/marctools


--
raffaele


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Owen Stephens
The MARC XML seemed to be an archive within an archive - I had to gunzip to get 
innzmetadata.xml then rename to innzmetadata.xml.gz and gunzip again to get the 
actual xml

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 3 Nov 2014, at 22:38, Robert Haschart rh...@virginia.edu wrote:
 
 I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc, since 
 I'm the creator of SolrMarc.
 It does provide many of the same tools as are described in the toolset you 
 linked to,  but it is designed to write to Solr rather than to a SQL style 
 database.   Solr may or may not be more suitable for your needs then a SQL 
 database.   However I decided to download the data to see whether SolrMarc 
 could handle it.   I started with the MARCXML.gz data, ungzipped it to get a 
 .XML file, but the resulting file causes SolrMarc to blow chunks.   Either 
 I'm missing something or there is something way wrong with that data.The 
 gzipped binary MARC file work fine with the SolrMarc tools.
 
 Creating a SolrMarc script to extract the 700 fields, plus a bash script to 
 cluster and count them, and sort by frequency took about 20 minutes.
 
 -Bob Haschart
 
 
 On 11/3/2014 3:00 PM, Stuart Yeates wrote:
 Thank you to all who responded with software suggestions. 
 https://github.com/ubleipzig/marctools is looking like the most promising 
 candidate so far. The more I read through the recommendations the more it 
 dawned on me that I don't want to have to configure yet another java 
 toolchain (yes I know, that may be personal bias).
 
 Thank you to all who responded about the challenges of authority control in 
 such collections. I'm aware of these issues. The current project is about 
 marshalling resources for editors to make informed decisions about rather 
 than automating the creation of articles, because there is human judgement 
 involved in the last step I can afford to take a few authority control 
 'risks'
 
 cheers
 stuart
 
 --
 I have a new phone number: 04 463 5692
 
 
 From: Code for LibrariesCODE4LIB@LISTSERV.ND.EDU  on behalf of raffaele 
 messutiraffaele.mess...@gmail.com
 Sent: Monday, 3 November 2014 11:39 p.m.
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARC reporting engine
 
 Stuart Yeates wrote:
 Do any of these have built-in indexing? 800k records isn't going to fit in 
 memory and if building my own MARC indexer is 'relatively straightforward' 
 then you're a better coder than I am.
 you could try marcdb[1] from marctools[2]
 
 [1] https://github.com/ubleipzig/marctools#marcdb
 [2] https://github.com/ubleipzig/marctools
 
 
 --
 raffaele


Re: [CODE4LIB] MARC reporting engine

2014-11-03 Thread Jonathan Rochkind
Hm. You don't need to keep all 800k records in memory, you just need to 
keep the data you need in memory, right? I'd keep a hash keyed by 
authorized heading, with the values I need there.


I don't think you'll have trouble keeping such a hash in memory, for a 
batch process run manually once in a while -- modern OS's do a great job 
with virtual memory making it invisible (but slower) when you use more 
memory than you have physically, if it comes to that, which it may not.


If you do, you could keep the data you need in the data store of your 
choice, such as a local DBM database, which ruby/python/perl will all 
let you do pretty painlessly, accessing a hash-like data structure which 
is actually stored on disk not in memory but which you access more or 
less the same as an in-memory hash.


But, yes, it will require some programming, for sure.

A MARC Indexer can mean many things, and I'm not sure you need one 
here, but as it happens I have built something you could describe as a 
MARC Indexer, and I guess it wasn't exactly straightforward, it's 
true. I'm not sure it's of any use to you here for your use case, but 
you can check it out at https://github.com/traject-project/traject


On 11/2/14 9:29 PM, Stuart Yeates wrote:

Do any of these have built-in indexing? 800k records isn't going to
fit in memory and if building my own MARC indexer is 'relatively
straightforward' then you're a better coder than I am.

cheers stuart

-- I have a new phone number: 04 463 5692

 From: Code for Libraries
CODE4LIB@LISTSERV.ND.EDU on behalf of Jonathan Rochkind
rochk...@jhu.edu Sent: Monday, 3 November 2014 1:24 p.m. To:
CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC reporting
engine

If you are, can become, or know, a programmer, that would be
relatively straightforward in any programming language using the open
source MARC processing library for that language. (ruby marc, pymarc,
perl marc, whatever).

Although you might find more trouble than you expect around
authorities, with them being less standardized in your corpus than
you might like.  From: Code
for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates
[stuart.yea...@vuw.ac.nz] Sent: Sunday, November 02, 2014 5:48 PM To:
CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] MARC reporting engine

I have ~800,000 MARC records from an indexing service
(http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am
trying to generate:

(a) a list of person authorities (and sundry metadata), sorted by how
many times they're referenced, in wikimedia syntax

(b) a view of a person authority, with all the records by which
they're referenced, processed into a wikipedia stub biography

I have established that this is too much data to process in XSLT or
multi-line regexps in vi. What other MARC engines are there out
there?

The two options I'm aware of are learning multi-line processing in
sed or learning enough koha to write reports in whatever their
reporting engine is.

Any advice?

cheers stuart -- I have a new phone number: 04 463 5692




Re: [CODE4LIB] MARC reporting engine

2014-11-02 Thread Jonathan Rochkind
If you are, can become, or know, a programmer, that would be relatively 
straightforward in any programming language using the open source MARC 
processing library for that language. (ruby marc, pymarc, perl marc, whatever). 
 

Although you might find more trouble than you expect around authorities, with 
them being less standardized in your corpus than you might like. 

From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates 
[stuart.yea...@vuw.ac.nz]
Sent: Sunday, November 02, 2014 5:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] MARC reporting engine

I have ~800,000 MARC records from an indexing service 
(http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to 
generate:

(a) a list of person authorities (and sundry metadata), sorted by how many 
times they're referenced, in wikimedia syntax

(b) a view of a person authority, with all the records by which they're 
referenced, processed into a wikipedia stub biography

I have established that this is too much data to process in XSLT or multi-line 
regexps in vi. What other MARC engines are there out there?

The two options I'm aware of are learning multi-line processing in sed or 
learning enough koha to write reports in whatever their reporting engine is.

Any advice?

cheers
stuart
--
I have a new phone number: 04 463 5692


Re: [CODE4LIB] MARC reporting engine

2014-11-02 Thread Miles Fidelman
It looks like the dataset is available in XML format.  Perhaps you can 
import it into an XML database (eXist - exist-db.org comes to mind), and 
then generate a report via its query capabilities.


Miles Fidelman

Jonathan Rochkind wrote:

If you are, can become, or know, a programmer, that would be relatively 
straightforward in any programming language using the open source MARC 
processing library for that language. (ruby marc, pymarc, perl marc, whatever).

Although you might find more trouble than you expect around authorities, with 
them being less standardized in your corpus than you might like.

From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates 
[stuart.yea...@vuw.ac.nz]
Sent: Sunday, November 02, 2014 5:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] MARC reporting engine

I have ~800,000 MARC records from an indexing service 
(http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to 
generate:

(a) a list of person authorities (and sundry metadata), sorted by how many 
times they're referenced, in wikimedia syntax

(b) a view of a person authority, with all the records by which they're 
referenced, processed into a wikipedia stub biography

I have established that this is too much data to process in XSLT or multi-line 
regexps in vi. What other MARC engines are there out there?

The two options I'm aware of are learning multi-line processing in sed or 
learning enough koha to write reports in whatever their reporting engine is.

Any advice?

cheers
stuart
--
I have a new phone number: 04 463 5692



--
In theory, there is no difference between theory and practice.
In practice, there is.    Yogi Berra


Re: [CODE4LIB] MARC reporting engine

2014-11-02 Thread Stuart Yeates
Do any of these have built-in indexing? 800k records isn't going to fit in 
memory and if building my own MARC indexer is 'relatively straightforward' then 
you're a better coder than I am. 

cheers
stuart

--
I have a new phone number: 04 463 5692


From: Code for Libraries CODE4LIB@LISTSERV.ND.EDU on behalf of Jonathan 
Rochkind rochk...@jhu.edu
Sent: Monday, 3 November 2014 1:24 p.m.
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC reporting engine

If you are, can become, or know, a programmer, that would be relatively 
straightforward in any programming language using the open source MARC 
processing library for that language. (ruby marc, pymarc, perl marc, whatever).

Although you might find more trouble than you expect around authorities, with 
them being less standardized in your corpus than you might like.

From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Stuart Yeates 
[stuart.yea...@vuw.ac.nz]
Sent: Sunday, November 02, 2014 5:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] MARC reporting engine

I have ~800,000 MARC records from an indexing service 
(http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am trying to 
generate:

(a) a list of person authorities (and sundry metadata), sorted by how many 
times they're referenced, in wikimedia syntax

(b) a view of a person authority, with all the records by which they're 
referenced, processed into a wikipedia stub biography

I have established that this is too much data to process in XSLT or multi-line 
regexps in vi. What other MARC engines are there out there?

The two options I'm aware of are learning multi-line processing in sed or 
learning enough koha to write reports in whatever their reporting engine is.

Any advice?

cheers
stuart
--
I have a new phone number: 04 463 5692