Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Bill Dueber
My guess is that traversing the WEM structure for display of a single
record (e.g., in a librarian's ILS client or what not) will not be a
problem at all, because the volume is so low.  In terms of the OPAC
interface itself, well, there are lots and lots of way to denormalize the
data (meaning "copy over and inline data whose canonical values are in
their own tables somewhere") for search and display purposes. Heck, lots of
us do this on a smaller and less complicated scale already, as we dump data
into Solr for our public catalogs.

This adds complexity to the system (determining what to denormalize,
determining when some underlying value has changed and knowing what other
elements need updating), but it's the sort of complexity that's been
well-studied and doesn't worry me too much.

I'm much, *much* more "nerd" than "librarian," and if there's one thing I
wish I could get across to people who swing the other way, it's that
getting the data model right is so very much harder than figuring out how
to process it. Make sure the individual elements are machine-intelligible,
and there are hoards of smart people (both within and outside of the
library world) who will figure out how efficiently(-enough) store and
retrieve it. And, for the love of god, have someone around who can at least
speak authoritatively about what sorts of things fall into the "hard" and
"easy-peasy" categories in terms of the technology, instead of making
assumptions.




On Wed, Oct 16, 2013 at 6:23 PM, Karen Coyle  wrote:

> Yes, that's my take as well, but I think it's worth quantifying if
> possible. There is the usual trade-off between time and space -- and I'd be
> interested in hearing whether anyone here thinks that there is any concern
> about traversing the WEM structure for each search and display. Does it
> matter if every display of author in a Manifestation has to connect M-E-W?
> Or is that a concern, like space, that is no longer relevant?
>
> kc
>
>
>
> On 10/16/13 12:57 PM, Bill Dueber wrote:
>
>> If anyone out there is really making a case for FRBR based on whether or
>> not it saves a few characters in a database, well, she should give up the
>> library business and go make money off  her time machine . Maybe --
>> *maybe* --
>>
>> 15 years ago. But I have to say, I'm sitting on 10m records right now, and
>> would happily figure out how to deal with double or triple the space
>> requirements for added utility. Space is always a consideration, but it's
>> slipped down into about 15th place on my Giant List of Things to Worry
>> About.
>>
>>
>> On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle  wrote:
>>
>>  On 10/16/13 12:33 PM, Kyle Banerjee wrote:
>>>
>>>  BTW, I don't think 240 is a good substitute as the content is very
 different than in the regular title. That's where you'll find music,
 laws,
 selections, translations and it's totally littered with subfields. The
 70.1
 figure from the stripped 245 is probably closer to the mark

  Yes, you are right, especially for the particular purpose I am looking
>>> at.
>>> Thanks.
>>>
>>>
>>>
>>>  IMO, what you stand to gain in functionality, maintenance, and analysis
 is
 much more interesting than potential space gains/losses.

  Yes, obviously. But there exists an apology for FRBR that says that it
>>> will save cataloger time and will be more efficient in a database. I
>>> think
>>> it's worth taking a look at those assumptions. If there is a way to
>>> measure
>>> functionality, maintenance, etc. then we should measure it, for sure.
>>>
>>> kc
>>>
>>>
>>>
>>>  kyle




 On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle  wrote:

   Thanks, Roy (and others!)

> It looks like the 245 is including the $c - dang! I should have been
> more
> specific. I'm mainly interested in the title, which is $a $b -- I'm
> looking
> at the gains and losses of bytes should one implement FRBR. As a hedge,
> could I ask what've you got for the 240? that may be closer to reality.
>
> kc
>
>
> On 10/16/13 10:57 AM, Roy Tennant wrote:
>
>   I don't even have to fire it up. That's a statistic that we generate
>
>> quarterly (albeit via Hadoop). Here you go:
>>
>> 100 - 30.3
>> 245 - 103.1
>> 600 - 41
>> 610 - 48.8
>> 611 - 61.4
>> 630 - 40.8
>> 648 - 23.8
>> 650 - 35.1
>> 651 - 39.6
>> 653 - 33.3
>> 654 - 38.1
>> 655 - 22.5
>> 656 - 30.6
>> 657 - 27.4
>> 658 - 30.7
>> 662 - 41.7
>>
>> Roy
>>
>>
>> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan 
>> wrote:
>>
>>That sounds like a request for Roy to fire up the ole OCLC Hadoop.
>>
>>  -Sean
>>>
>>>
>>>
>>> On 10/16/13 1:06 PM, "Karen Coyle"  wrote:
>>>
>>>Anybody have data for the average length of specific MARC fields
>>> in
>>> some
>>>
>>>  reasonably representative

Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Karen Coyle

On 10/16/13 4:22 PM, Kyle Banerjee wrote:


In some ways, FRBR strikes me as the catalogers' answer to the miserable
seven layer OSI model which often confuses rather than clarifies -- largely
because it doesn't reflect reality very well.


Agreed. I am having trouble seeing FRBR as being beneficial, much less 
necessary. However, there is a wide-spread assumption that FRBR's WEMI 
will be implemented as a four-level, linked set of hierarchical 
entities, rather than that FRBR is a conceptual model (which is what the 
FRBR documentation says). If there are reasons to present users with 
works, expressions and manifestations, nothing in that requires a 
physical model that looks like some kind of relational database design. 
Yet, that seems to be what many people assume. So I'd like to expose 
that myth, or at least provide a way to discuss it.


kc


kyle


On Wed, Oct 16, 2013 at 3:23 PM, Karen Coyle  wrote:


Yes, that's my take as well, but I think it's worth quantifying if
possible. There is the usual trade-off between time and space -- and I'd be
interested in hearing whether anyone here thinks that there is any concern
about traversing the WEM structure for each search and display. Does it
matter if every display of author in a Manifestation has to connect M-E-W?
Or is that a concern, like space, that is no longer relevant?

kc



On 10/16/13 12:57 PM, Bill Dueber wrote:


If anyone out there is really making a case for FRBR based on whether or
not it saves a few characters in a database, well, she should give up the
library business and go make money off  her time machine . Maybe --
*maybe* --
15 years ago. But I have to say, I'm sitting on 10m records right now, and
would happily figure out how to deal with double or triple the space
requirements for added utility. Space is always a consideration, but it's
slipped down into about 15th place on my Giant List of Things to Worry
About.


On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle  wrote:

  On 10/16/13 12:33 PM, Kyle Banerjee wrote:

  BTW, I don't think 240 is a good substitute as the content is very

different than in the regular title. That's where you'll find music,
laws,
selections, translations and it's totally littered with subfields. The
70.1
figure from the stripped 245 is probably closer to the mark

  Yes, you are right, especially for the particular purpose I am looking

at.
Thanks.



  IMO, what you stand to gain in functionality, maintenance, and analysis

is
much more interesting than potential space gains/losses.

  Yes, obviously. But there exists an apology for FRBR that says that it

will save cataloger time and will be more efficient in a database. I
think
it's worth taking a look at those assumptions. If there is a way to
measure
functionality, maintenance, etc. then we should measure it, for sure.

kc



  kyle




On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle  wrote:

   Thanks, Roy (and others!)


It looks like the 245 is including the $c - dang! I should have been
more
specific. I'm mainly interested in the title, which is $a $b -- I'm
looking
at the gains and losses of bytes should one implement FRBR. As a hedge,
could I ask what've you got for the 240? that may be closer to reality.

kc


On 10/16/13 10:57 AM, Roy Tennant wrote:

   I don't even have to fire it up. That's a statistic that we generate


quarterly (albeit via Hadoop). Here you go:

100 - 30.3
245 - 103.1
600 - 41
610 - 48.8
611 - 61.4
630 - 40.8
648 - 23.8
650 - 35.1
651 - 39.6
653 - 33.3
654 - 38.1
655 - 22.5
656 - 30.6
657 - 27.4
658 - 30.7
662 - 41.7

Roy


On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan 
wrote:

That sounds like a request for Roy to fire up the ole OCLC Hadoop.

  -Sean



On 10/16/13 1:06 PM, "Karen Coyle"  wrote:

Anybody have data for the average length of specific MARC fields
in
some

  reasonably representative database? I mainly need 100, 245, 6xx.

Thanks,
kc

--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet

   --


Karen Coyle

kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


  --

Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet





--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet



--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Kyle Banerjee
Depends on how many requests the service has to accommodate. Up to a point,
it's no big deal. After a certain point, servicing lots of calls gets
expensive and bang for the buck is brought into question.

My bigger concern would be getting data encoded/structured consistently.
Even though FRBR has been around for a long time, people spend a lot of
time scratching their heads about really basic stuff (e.g. what level
something belongs on) when dealing with real world use cases. And it's hard
to automate tasks when the people aren't sure what the machine needs to do.

In some ways, FRBR strikes me as the catalogers' answer to the miserable
seven layer OSI model which often confuses rather than clarifies -- largely
because it doesn't reflect reality very well.

kyle


On Wed, Oct 16, 2013 at 3:23 PM, Karen Coyle  wrote:

> Yes, that's my take as well, but I think it's worth quantifying if
> possible. There is the usual trade-off between time and space -- and I'd be
> interested in hearing whether anyone here thinks that there is any concern
> about traversing the WEM structure for each search and display. Does it
> matter if every display of author in a Manifestation has to connect M-E-W?
> Or is that a concern, like space, that is no longer relevant?
>
> kc
>
>
>
> On 10/16/13 12:57 PM, Bill Dueber wrote:
>
>> If anyone out there is really making a case for FRBR based on whether or
>> not it saves a few characters in a database, well, she should give up the
>> library business and go make money off  her time machine . Maybe --
>> *maybe* --
>> 15 years ago. But I have to say, I'm sitting on 10m records right now, and
>> would happily figure out how to deal with double or triple the space
>> requirements for added utility. Space is always a consideration, but it's
>> slipped down into about 15th place on my Giant List of Things to Worry
>> About.
>>
>>
>> On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle  wrote:
>>
>>  On 10/16/13 12:33 PM, Kyle Banerjee wrote:
>>>
>>>  BTW, I don't think 240 is a good substitute as the content is very
 different than in the regular title. That's where you'll find music,
 laws,
 selections, translations and it's totally littered with subfields. The
 70.1
 figure from the stripped 245 is probably closer to the mark

  Yes, you are right, especially for the particular purpose I am looking
>>> at.
>>> Thanks.
>>>
>>>
>>>
>>>  IMO, what you stand to gain in functionality, maintenance, and analysis
 is
 much more interesting than potential space gains/losses.

  Yes, obviously. But there exists an apology for FRBR that says that it
>>> will save cataloger time and will be more efficient in a database. I
>>> think
>>> it's worth taking a look at those assumptions. If there is a way to
>>> measure
>>> functionality, maintenance, etc. then we should measure it, for sure.
>>>
>>> kc
>>>
>>>
>>>
>>>  kyle




 On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle  wrote:

   Thanks, Roy (and others!)

> It looks like the 245 is including the $c - dang! I should have been
> more
> specific. I'm mainly interested in the title, which is $a $b -- I'm
> looking
> at the gains and losses of bytes should one implement FRBR. As a hedge,
> could I ask what've you got for the 240? that may be closer to reality.
>
> kc
>
>
> On 10/16/13 10:57 AM, Roy Tennant wrote:
>
>   I don't even have to fire it up. That's a statistic that we generate
>
>> quarterly (albeit via Hadoop). Here you go:
>>
>> 100 - 30.3
>> 245 - 103.1
>> 600 - 41
>> 610 - 48.8
>> 611 - 61.4
>> 630 - 40.8
>> 648 - 23.8
>> 650 - 35.1
>> 651 - 39.6
>> 653 - 33.3
>> 654 - 38.1
>> 655 - 22.5
>> 656 - 30.6
>> 657 - 27.4
>> 658 - 30.7
>> 662 - 41.7
>>
>> Roy
>>
>>
>> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan 
>> wrote:
>>
>>That sounds like a request for Roy to fire up the ole OCLC Hadoop.
>>
>>  -Sean
>>>
>>>
>>>
>>> On 10/16/13 1:06 PM, "Karen Coyle"  wrote:
>>>
>>>Anybody have data for the average length of specific MARC fields
>>> in
>>> some
>>>
>>>  reasonably representative database? I mainly need 100, 245, 6xx.

 Thanks,
 kc

 --
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 m: 1-510-435-8234
 skype: kcoylenet

   --

>>> Karen Coyle
> kco...@kcoyle.net http://kcoyle.net
> m: 1-510-435-8234
> skype: kcoylenet
>
>
>  --
>>> Karen Coyle
>>> kco...@kcoyle.net http://kcoyle.net
>>> m: 1-510-435-8234
>>> skype: kcoylenet
>>>
>>>
>>
>>
> --
> Karen Coyle
> kco...@kcoyle.net http://kcoyle.net
> m: 1-510-435-8234
> skype: kcoylenet
>


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Karen Coyle
Yes, that's my take as well, but I think it's worth quantifying if 
possible. There is the usual trade-off between time and space -- and I'd 
be interested in hearing whether anyone here thinks that there is any 
concern about traversing the WEM structure for each search and display. 
Does it matter if every display of author in a Manifestation has to 
connect M-E-W? Or is that a concern, like space, that is no longer relevant?


kc


On 10/16/13 12:57 PM, Bill Dueber wrote:

If anyone out there is really making a case for FRBR based on whether or
not it saves a few characters in a database, well, she should give up the
library business and go make money off  her time machine . Maybe -- *maybe* --
15 years ago. But I have to say, I'm sitting on 10m records right now, and
would happily figure out how to deal with double or triple the space
requirements for added utility. Space is always a consideration, but it's
slipped down into about 15th place on my Giant List of Things to Worry
About.


On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle  wrote:


On 10/16/13 12:33 PM, Kyle Banerjee wrote:


BTW, I don't think 240 is a good substitute as the content is very
different than in the regular title. That's where you'll find music, laws,
selections, translations and it's totally littered with subfields. The
70.1
figure from the stripped 245 is probably closer to the mark


Yes, you are right, especially for the particular purpose I am looking at.
Thanks.




IMO, what you stand to gain in functionality, maintenance, and analysis is
much more interesting than potential space gains/losses.


Yes, obviously. But there exists an apology for FRBR that says that it
will save cataloger time and will be more efficient in a database. I think
it's worth taking a look at those assumptions. If there is a way to measure
functionality, maintenance, etc. then we should measure it, for sure.

kc




kyle




On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle  wrote:

  Thanks, Roy (and others!)

It looks like the 245 is including the $c - dang! I should have been more
specific. I'm mainly interested in the title, which is $a $b -- I'm
looking
at the gains and losses of bytes should one implement FRBR. As a hedge,
could I ask what've you got for the 240? that may be closer to reality.

kc


On 10/16/13 10:57 AM, Roy Tennant wrote:

  I don't even have to fire it up. That's a statistic that we generate

quarterly (albeit via Hadoop). Here you go:

100 - 30.3
245 - 103.1
600 - 41
610 - 48.8
611 - 61.4
630 - 40.8
648 - 23.8
650 - 35.1
651 - 39.6
653 - 33.3
654 - 38.1
655 - 22.5
656 - 30.6
657 - 27.4
658 - 30.7
662 - 41.7

Roy


On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan  wrote:

   That sounds like a request for Roy to fire up the ole OCLC Hadoop.


-Sean



On 10/16/13 1:06 PM, "Karen Coyle"  wrote:

   Anybody have data for the average length of specific MARC fields in
some


reasonably representative database? I mainly need 100, 245, 6xx.

Thanks,
kc

--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet

  --

Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet



--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet






--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] local APIs atop III's Sierra DB

2013-10-16 Thread Joshua Welker
Thought I'd share this work put together by the folks in charge of our
consortium:

https://github.com/mcoia/sierra_marc_tools

It's a Perl implementation. I haven't used it myself, but I know it can
generate MARC records.

Josh Welker

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Rob Casson
Sent: Wednesday, October 16, 2013 1:05 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] local APIs atop III's Sierra DB

i've done some very ugly, preliminary hacking at getting MARC records out:

https://gist.github.com/roblivian/7012077

generally "works", but still need to account for more invalid MARC tags,
"on-the-fly" records (non-MARC records, i.e. reserve items, ordered bibs,
etc)



On Wed, Oct 16, 2013 at 10:49 AM, Thomale, Jason
wrote:

> Everyone: You guys are fantastic. Thanks to those who have responded
> thus far for being so willing to share. I will be contacting y'all
> off-list, if you don't mind. :-)
>
> Just wanted to tag onto Dave's response here...
>
> > I've written a decent amount of code against Sierra, but I don't
> > know if any of it amounts to an "API".
> >
> ...
> > * I've also started creating little web services with mod_perl for
> > use in a web-application I'm working on.  Examples: a script that
> > spits back item information in JSON when given an item barcode, a
> > script that spits back a JSON list of all attached items when given
> > a bib record number.  Again these are mostly special purpose, but I
> > have a notion to find ways to generalize them.
>
> Yes this is basically where I am right now and where this is coming
from.
> I've thrown together sort of a prototype app for helping us with some
> inventory stuff we're doing, which consists of a really
> quick-and-dirty web service that serves up JSON and a bootstrap/jQuery
> front-end. For what it is--which at this point isn't much more than a
proof-of-concept--it works.
> But. In the coming year there are a lot of similar things we plan to
> do, and building out a RESTful API to serve up catalog data in
> particular ways seems like a logical step right now.
>
> Julia alluded to "some things you don't want to do when you're
> querying the database," which is something I'm interested in talking
about as well.
> If my experiences are anything like yours, Julia, I'm finding things
> just aren't indexed in ways that make it optimal for our use cases.
> Namely, querying on most variable field data is out of the question if
> you don't want multi-minute response times. It seems the only way to
> get this to work well will be to dump portions of the database out to
> an external document store / indexer. I'm primarily looking at serving
> up JSON at this point, so probably something like Solr or
> Elasticsearch. Learning from your experiences building a Sierra driver
> for VuFind would be quite helpful and interesting.
>
> Francis, I'll be interested to see whether you're thinking along
> similar lines or if you're going a totally different direction...
>
> > Sadly, I'm a team of one here and I'm a bit shy about the state my
> > code is currently in, so I haven't published it anywhere.  ( Also
> > the way I use git locally is probably "wrong", not to mention there
> > are probably passwords in old commits. )
>
> No worries! I completely understand, and I share your shyness. Believe
> me, I'm the last person that should judge.
>
> > Nonetheless, I'd definitely be interested in collaborating on
> > anything that might benefit all Sierra users.
>
> Cool. I really appreciate it. I guess--at this point I'm still looking
> at solving local needs first, but making it easy enough to extend to
> new use cases. Or...at the very least doing something that will
> provide for a good learning experience. :-) I don't know, it's still
ideas.
>
> Thanks,
>
> Jason
>


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Bill Dueber
If anyone out there is really making a case for FRBR based on whether or
not it saves a few characters in a database, well, she should give up the
library business and go make money off  her time machine . Maybe -- *maybe* --
15 years ago. But I have to say, I'm sitting on 10m records right now, and
would happily figure out how to deal with double or triple the space
requirements for added utility. Space is always a consideration, but it's
slipped down into about 15th place on my Giant List of Things to Worry
About.


On Wed, Oct 16, 2013 at 3:49 PM, Karen Coyle  wrote:

> On 10/16/13 12:33 PM, Kyle Banerjee wrote:
>
>> BTW, I don't think 240 is a good substitute as the content is very
>> different than in the regular title. That's where you'll find music, laws,
>> selections, translations and it's totally littered with subfields. The
>> 70.1
>> figure from the stripped 245 is probably closer to the mark
>>
>
> Yes, you are right, especially for the particular purpose I am looking at.
> Thanks.
>
>
>
>> IMO, what you stand to gain in functionality, maintenance, and analysis is
>> much more interesting than potential space gains/losses.
>>
>
> Yes, obviously. But there exists an apology for FRBR that says that it
> will save cataloger time and will be more efficient in a database. I think
> it's worth taking a look at those assumptions. If there is a way to measure
> functionality, maintenance, etc. then we should measure it, for sure.
>
> kc
>
>
>
>> kyle
>>
>>
>>
>>
>> On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle  wrote:
>>
>>  Thanks, Roy (and others!)
>>>
>>> It looks like the 245 is including the $c - dang! I should have been more
>>> specific. I'm mainly interested in the title, which is $a $b -- I'm
>>> looking
>>> at the gains and losses of bytes should one implement FRBR. As a hedge,
>>> could I ask what've you got for the 240? that may be closer to reality.
>>>
>>> kc
>>>
>>>
>>> On 10/16/13 10:57 AM, Roy Tennant wrote:
>>>
>>>  I don't even have to fire it up. That's a statistic that we generate
 quarterly (albeit via Hadoop). Here you go:

 100 - 30.3
 245 - 103.1
 600 - 41
 610 - 48.8
 611 - 61.4
 630 - 40.8
 648 - 23.8
 650 - 35.1
 651 - 39.6
 653 - 33.3
 654 - 38.1
 655 - 22.5
 656 - 30.6
 657 - 27.4
 658 - 30.7
 662 - 41.7

 Roy


 On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan  wrote:

   That sounds like a request for Roy to fire up the ole OCLC Hadoop.

> -Sean
>
>
>
> On 10/16/13 1:06 PM, "Karen Coyle"  wrote:
>
>   Anybody have data for the average length of specific MARC fields in
> some
>
>> reasonably representative database? I mainly need 100, 245, 6xx.
>>
>> Thanks,
>> kc
>>
>> --
>> Karen Coyle
>> kco...@kcoyle.net http://kcoyle.net
>> m: 1-510-435-8234
>> skype: kcoylenet
>>
>>  --
>>> Karen Coyle
>>> kco...@kcoyle.net http://kcoyle.net
>>> m: 1-510-435-8234
>>> skype: kcoylenet
>>>
>>>
> --
> Karen Coyle
> kco...@kcoyle.net http://kcoyle.net
> m: 1-510-435-8234
> skype: kcoylenet
>



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Bill Dueber
For the HathiTrust catalog's 6,046,746 bibs and looking at only the lengths
of the subfields $a and $b in 245s, I get an average length of  62.0


On Wed, Oct 16, 2013 at 3:24 PM, Kyle Banerjee wrote:

> 245 not including $c, indicators, or delimiters, |h (which occurs before
> |b), |n, |p, with trailing slash preceding |c stripped for about 9 million
> records for Orbis Cascade collections is 70.1
>
> kyle
>
>
> On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle  wrote:
>
> > Thanks, Roy (and others!)
> >
> > It looks like the 245 is including the $c - dang! I should have been more
> > specific. I'm mainly interested in the title, which is $a $b -- I'm
> looking
> > at the gains and losses of bytes should one implement FRBR. As a hedge,
> > could I ask what've you got for the 240? that may be closer to reality.
> >
> > kc
> >
> >
> > On 10/16/13 10:57 AM, Roy Tennant wrote:
> >
> >> I don't even have to fire it up. That's a statistic that we generate
> >> quarterly (albeit via Hadoop). Here you go:
> >>
> >> 100 - 30.3
> >> 245 - 103.1
> >> 600 - 41
> >> 610 - 48.8
> >> 611 - 61.4
> >> 630 - 40.8
> >> 648 - 23.8
> >> 650 - 35.1
> >> 651 - 39.6
> >> 653 - 33.3
> >> 654 - 38.1
> >> 655 - 22.5
> >> 656 - 30.6
> >> 657 - 27.4
> >> 658 - 30.7
> >> 662 - 41.7
> >>
> >> Roy
> >>
> >>
> >> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan  wrote:
> >>
> >>  That sounds like a request for Roy to fire up the ole OCLC Hadoop.
> >>>
> >>> -Sean
> >>>
> >>>
> >>>
> >>> On 10/16/13 1:06 PM, "Karen Coyle"  wrote:
> >>>
> >>>  Anybody have data for the average length of specific MARC fields in
> some
>  reasonably representative database? I mainly need 100, 245, 6xx.
> 
>  Thanks,
>  kc
> 
>  --
>  Karen Coyle
>  kco...@kcoyle.net http://kcoyle.net
>  m: 1-510-435-8234
>  skype: kcoylenet
> 
> >>>
> > --
> > Karen Coyle
> > kco...@kcoyle.net http://kcoyle.net
> > m: 1-510-435-8234
> > skype: kcoylenet
> >
>



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Karen Coyle

On 10/16/13 12:33 PM, Kyle Banerjee wrote:

BTW, I don't think 240 is a good substitute as the content is very
different than in the regular title. That's where you'll find music, laws,
selections, translations and it's totally littered with subfields. The 70.1
figure from the stripped 245 is probably closer to the mark


Yes, you are right, especially for the particular purpose I am looking 
at. Thanks.




IMO, what you stand to gain in functionality, maintenance, and analysis is
much more interesting than potential space gains/losses.


Yes, obviously. But there exists an apology for FRBR that says that it 
will save cataloger time and will be more efficient in a database. I 
think it's worth taking a look at those assumptions. If there is a way 
to measure functionality, maintenance, etc. then we should measure it, 
for sure.


kc



kyle




On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle  wrote:


Thanks, Roy (and others!)

It looks like the 245 is including the $c - dang! I should have been more
specific. I'm mainly interested in the title, which is $a $b -- I'm looking
at the gains and losses of bytes should one implement FRBR. As a hedge,
could I ask what've you got for the 240? that may be closer to reality.

kc


On 10/16/13 10:57 AM, Roy Tennant wrote:


I don't even have to fire it up. That's a statistic that we generate
quarterly (albeit via Hadoop). Here you go:

100 - 30.3
245 - 103.1
600 - 41
610 - 48.8
611 - 61.4
630 - 40.8
648 - 23.8
650 - 35.1
651 - 39.6
653 - 33.3
654 - 38.1
655 - 22.5
656 - 30.6
657 - 27.4
658 - 30.7
662 - 41.7

Roy


On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan  wrote:

  That sounds like a request for Roy to fire up the ole OCLC Hadoop.

-Sean



On 10/16/13 1:06 PM, "Karen Coyle"  wrote:

  Anybody have data for the average length of specific MARC fields in some

reasonably representative database? I mainly need 100, 245, 6xx.

Thanks,
kc

--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet



--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Kyle Banerjee
BTW, I don't think 240 is a good substitute as the content is very
different than in the regular title. That's where you'll find music, laws,
selections, translations and it's totally littered with subfields. The 70.1
figure from the stripped 245 is probably closer to the mark

IMO, what you stand to gain in functionality, maintenance, and analysis is
much more interesting than potential space gains/losses.

kyle




On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle  wrote:

> Thanks, Roy (and others!)
>
> It looks like the 245 is including the $c - dang! I should have been more
> specific. I'm mainly interested in the title, which is $a $b -- I'm looking
> at the gains and losses of bytes should one implement FRBR. As a hedge,
> could I ask what've you got for the 240? that may be closer to reality.
>
> kc
>
>
> On 10/16/13 10:57 AM, Roy Tennant wrote:
>
>> I don't even have to fire it up. That's a statistic that we generate
>> quarterly (albeit via Hadoop). Here you go:
>>
>> 100 - 30.3
>> 245 - 103.1
>> 600 - 41
>> 610 - 48.8
>> 611 - 61.4
>> 630 - 40.8
>> 648 - 23.8
>> 650 - 35.1
>> 651 - 39.6
>> 653 - 33.3
>> 654 - 38.1
>> 655 - 22.5
>> 656 - 30.6
>> 657 - 27.4
>> 658 - 30.7
>> 662 - 41.7
>>
>> Roy
>>
>>
>> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan  wrote:
>>
>>  That sounds like a request for Roy to fire up the ole OCLC Hadoop.
>>>
>>> -Sean
>>>
>>>
>>>
>>> On 10/16/13 1:06 PM, "Karen Coyle"  wrote:
>>>
>>>  Anybody have data for the average length of specific MARC fields in some
 reasonably representative database? I mainly need 100, 245, 6xx.

 Thanks,
 kc

 --
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 m: 1-510-435-8234
 skype: kcoylenet

>>>
> --
> Karen Coyle
> kco...@kcoyle.net http://kcoyle.net
> m: 1-510-435-8234
> skype: kcoylenet
>


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Nicolas Franck
Are you familiar with OAI-PMH protocol? We have almost 2 miljoen records
available over this protocol:

http://search.ugent.be/meercat/x/oai?verb=ListRecords&metadataPrefix=marcxml


From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Karen Coyle 
[li...@kcoyle.net]
Sent: Wednesday, October 16, 2013 7:06 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] MARC field lengths

Anybody have data for the average length of specific MARC fields in some
reasonably representative database? I mainly need 100, 245, 6xx.

Thanks,
kc

--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Kyle Banerjee
245 not including $c, indicators, or delimiters, |h (which occurs before
|b), |n, |p, with trailing slash preceding |c stripped for about 9 million
records for Orbis Cascade collections is 70.1

kyle


On Wed, Oct 16, 2013 at 12:00 PM, Karen Coyle  wrote:

> Thanks, Roy (and others!)
>
> It looks like the 245 is including the $c - dang! I should have been more
> specific. I'm mainly interested in the title, which is $a $b -- I'm looking
> at the gains and losses of bytes should one implement FRBR. As a hedge,
> could I ask what've you got for the 240? that may be closer to reality.
>
> kc
>
>
> On 10/16/13 10:57 AM, Roy Tennant wrote:
>
>> I don't even have to fire it up. That's a statistic that we generate
>> quarterly (albeit via Hadoop). Here you go:
>>
>> 100 - 30.3
>> 245 - 103.1
>> 600 - 41
>> 610 - 48.8
>> 611 - 61.4
>> 630 - 40.8
>> 648 - 23.8
>> 650 - 35.1
>> 651 - 39.6
>> 653 - 33.3
>> 654 - 38.1
>> 655 - 22.5
>> 656 - 30.6
>> 657 - 27.4
>> 658 - 30.7
>> 662 - 41.7
>>
>> Roy
>>
>>
>> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan  wrote:
>>
>>  That sounds like a request for Roy to fire up the ole OCLC Hadoop.
>>>
>>> -Sean
>>>
>>>
>>>
>>> On 10/16/13 1:06 PM, "Karen Coyle"  wrote:
>>>
>>>  Anybody have data for the average length of specific MARC fields in some
 reasonably representative database? I mainly need 100, 245, 6xx.

 Thanks,
 kc

 --
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 m: 1-510-435-8234
 skype: kcoylenet

>>>
> --
> Karen Coyle
> kco...@kcoyle.net http://kcoyle.net
> m: 1-510-435-8234
> skype: kcoylenet
>


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Karen Coyle

Thanks, Roy (and others!)

It looks like the 245 is including the $c - dang! I should have been 
more specific. I'm mainly interested in the title, which is $a $b -- I'm 
looking at the gains and losses of bytes should one implement FRBR. As a 
hedge, could I ask what've you got for the 240? that may be closer to 
reality.


kc

On 10/16/13 10:57 AM, Roy Tennant wrote:

I don't even have to fire it up. That's a statistic that we generate
quarterly (albeit via Hadoop). Here you go:

100 - 30.3
245 - 103.1
600 - 41
610 - 48.8
611 - 61.4
630 - 40.8
648 - 23.8
650 - 35.1
651 - 39.6
653 - 33.3
654 - 38.1
655 - 22.5
656 - 30.6
657 - 27.4
658 - 30.7
662 - 41.7

Roy


On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan  wrote:


That sounds like a request for Roy to fire up the ole OCLC Hadoop.

-Sean



On 10/16/13 1:06 PM, "Karen Coyle"  wrote:


Anybody have data for the average length of specific MARC fields in some
reasonably representative database? I mainly need 100, 245, 6xx.

Thanks,
kc

--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] Tool for feedback on document

2013-10-16 Thread McCanna, Terran
I've used http://a.nnotate.com/ for this several times. You can leave comments 
in line with the text, respond to other comments, display/print the comments in 
different ways, and one of my favorite things is that the people you send the 
link to don't have to create an account. 


Terran McCanna 
PINES Program Manager 
Georgia Public Library Service 
1800 Century Place, Suite 150 
Atlanta, GA 30345 
404-235-7138 
tmcca...@georgialibraries.org 

- Original Message -
From: "Ken Varnum" 
To: CODE4LIB@LISTSERV.ND.EDU
Sent: Wednesday, October 16, 2013 2:23:51 PM
Subject: Re: [CODE4LIB] Tool for feedback on document

Commentpress and digress.it are two Wordpress variants that offer
paragraph-by-paragraph threaded commenting. Commentpress is quite old (we
used it here: http://www.lib.umich.edu/islamic/ in a collaborative
cataloging project sponsored by CLIR and funded by Mellon).


--
Ken Varnum | Web Systems Manager | MLibrary - University of Michigan - Ann
Arbor
var...@umich.edu | @varnum | http://www.lib.umich.edu/users/varnum |
734-615-3287


On Wed, Oct 16, 2013 at 2:12 PM, Michael J. Giarlo <
leftw...@alumni.rutgers.edu> wrote:

> Hi David,
>
> Google Drive (née Docs) will allow you to share your document with other
> users so that they can view and comment (and not edit), FWIW.  There may be
> more elegant solutions that allow, say, nested/threaded comments.  I know
> there is blog software out there that does this, but it's been a few years
> so I forget what it's called.
>
> -Mike
> 
>
>
> On Wed, Oct 16, 2013 at 11:06 AM, Walker, David  >wrote:
>
> > Hi all,
> >
> > We're looking to put together a large policy document, and would like to
> > be able to solicit feedback on the text from librarians and staff across
> > two dozen institutions.
> >
> > We could just do that via email, of course.  But I thought it might be
> > better to have something web-based.  A wiki is not the best solution
> here,
> > as I don't want those providing feedback to be able to change the text
> > itself, but rather just leave comments.
> >
> > My fall back plan is to just use Wordpress, breaking the document up into
> > various pages or posts, which people can then comment on.  But it seems
> to
> > me there must be a better solutions here -- maybe one where people can
> > leave comments in line with the text?
> >
> > Any suggestions?
> >
> > Thanks,
> >
> > --Dave
> >
> > -
> > David Walker
> > Director, Systemwide Digital Library Services
> > California State University
> > 562-355-4845
> >
>


Re: [CODE4LIB] Tool for feedback on document

2013-10-16 Thread Erik Hetzner
At Wed, 16 Oct 2013 11:06:02 -0700,
Walker, David wrote:
> 
> Hi all,
> 
> We're looking to put together a large policy document, and would
> like to be able to solicit feedback on the text from librarians and
> staff across two dozen institutions.
> 
> We could just do that via email, of course. But I thought it might
> be better to have something web-based. A wiki is not the best
> solution here, as I don't want those providing feedback to be able
> to change the text itself, but rather just leave comments.
> 
> My fall back plan is to just use Wordpress, breaking the document up
> into various pages or posts, which people can then comment on. But
> it seems to me there must be a better solutions here -- maybe one
> where people can leave comments in line with the text?

Hi David,

For the GPLv3 process, the Free Software Foundation developed a web
application named stet for annotating and commenting on a text.
Apparently the successor to that is considered co-ment [1] which has a
gratis “lite” version [2]. That might solve your need. I’ve never
tried it.

best, Erik

1. http://www.co-ment.com/
2. https://lite.co-ment.com/
Sent from my free software system .


Re: [CODE4LIB] Tool for feedback on document

2013-10-16 Thread Mark A. Matienzo
Hi David,

In the past, I've used Digress.it  with WordPress for
this - I've set this up for  the Society of American Archivists Reappraisal
and Deaccessioning Development and Review Team: .

Mark

--
Mark A. Matienzo 
Digital Archivist, Manuscripts and Archives, Yale University Library
Technical Architect, ArchivesSpace


On Wed, Oct 16, 2013 at 2:12 PM, Michael J. Giarlo <
leftw...@alumni.rutgers.edu> wrote:

> ​Hi David,
>
> Google Drive (née Docs) will allow you to share your document with other
> users so that they can view and comment (and not edit), FWIW.  There may be
> more elegant solutions that allow, say, nested/threaded comments.  I know
> there is blog software out there that does this, but it's been a few years
> so I forget what it's called.
>
> -Mike
> ​
>
>
> On Wed, Oct 16, 2013 at 11:06 AM, Walker, David  >wrote:
>
> > Hi all,
> >
> > We're looking to put together a large policy document, and would like to
> > be able to solicit feedback on the text from librarians and staff across
> > two dozen institutions.
> >
> > We could just do that via email, of course.  But I thought it might be
> > better to have something web-based.  A wiki is not the best solution
> here,
> > as I don't want those providing feedback to be able to change the text
> > itself, but rather just leave comments.
> >
> > My fall back plan is to just use Wordpress, breaking the document up into
> > various pages or posts, which people can then comment on.  But it seems
> to
> > me there must be a better solutions here -- maybe one where people can
> > leave comments in line with the text?
> >
> > Any suggestions?
> >
> > Thanks,
> >
> > --Dave
> >
> > -
> > David Walker
> > Director, Systemwide Digital Library Services
> > California State University
> > 562-355-4845
> >
>


Re: [CODE4LIB] Tool for feedback on document

2013-10-16 Thread Ken Varnum
Commentpress and digress.it are two Wordpress variants that offer
paragraph-by-paragraph threaded commenting. Commentpress is quite old (we
used it here: http://www.lib.umich.edu/islamic/ in a collaborative
cataloging project sponsored by CLIR and funded by Mellon).


--
Ken Varnum | Web Systems Manager | MLibrary - University of Michigan - Ann
Arbor
var...@umich.edu | @varnum | http://www.lib.umich.edu/users/varnum |
734-615-3287


On Wed, Oct 16, 2013 at 2:12 PM, Michael J. Giarlo <
leftw...@alumni.rutgers.edu> wrote:

> Hi David,
>
> Google Drive (née Docs) will allow you to share your document with other
> users so that they can view and comment (and not edit), FWIW.  There may be
> more elegant solutions that allow, say, nested/threaded comments.  I know
> there is blog software out there that does this, but it's been a few years
> so I forget what it's called.
>
> -Mike
> 
>
>
> On Wed, Oct 16, 2013 at 11:06 AM, Walker, David  >wrote:
>
> > Hi all,
> >
> > We're looking to put together a large policy document, and would like to
> > be able to solicit feedback on the text from librarians and staff across
> > two dozen institutions.
> >
> > We could just do that via email, of course.  But I thought it might be
> > better to have something web-based.  A wiki is not the best solution
> here,
> > as I don't want those providing feedback to be able to change the text
> > itself, but rather just leave comments.
> >
> > My fall back plan is to just use Wordpress, breaking the document up into
> > various pages or posts, which people can then comment on.  But it seems
> to
> > me there must be a better solutions here -- maybe one where people can
> > leave comments in line with the text?
> >
> > Any suggestions?
> >
> > Thanks,
> >
> > --Dave
> >
> > -
> > David Walker
> > Director, Systemwide Digital Library Services
> > California State University
> > 562-355-4845
> >
>


Re: [CODE4LIB] Tool for feedback on document

2013-10-16 Thread Michael J. Giarlo
​Hi David,

Google Drive (née Docs) will allow you to share your document with other
users so that they can view and comment (and not edit), FWIW.  There may be
more elegant solutions that allow, say, nested/threaded comments.  I know
there is blog software out there that does this, but it's been a few years
so I forget what it's called.

-Mike
​


On Wed, Oct 16, 2013 at 11:06 AM, Walker, David wrote:

> Hi all,
>
> We're looking to put together a large policy document, and would like to
> be able to solicit feedback on the text from librarians and staff across
> two dozen institutions.
>
> We could just do that via email, of course.  But I thought it might be
> better to have something web-based.  A wiki is not the best solution here,
> as I don't want those providing feedback to be able to change the text
> itself, but rather just leave comments.
>
> My fall back plan is to just use Wordpress, breaking the document up into
> various pages or posts, which people can then comment on.  But it seems to
> me there must be a better solutions here -- maybe one where people can
> leave comments in line with the text?
>
> Any suggestions?
>
> Thanks,
>
> --Dave
>
> -
> David Walker
> Director, Systemwide Digital Library Services
> California State University
> 562-355-4845
>


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Kyle Banerjee
Argh. Must learn to write at third grade level

I wanted to say I like breaking up 6XX as Roy has done because 6XX fields
vary in purpose and tag frequency varies considerably.


On Wed, Oct 16, 2013 at 11:08 AM, Kyle Banerjee wrote:

> This squares with what I'm seeing. Data for all holdings of the Orbis
> Cascade Alliance is:
>
> 100: 30.1
> 245: 114.1
> 6XX: 36.1
>
> My values include indicators (2 characters) as well as any delimiters but
> not the tag number itself. I breaking up 6XX up as Roy has as 6XX's are far
> from created equal and frequency of occurrence varies radically with tag.
>
> I'm going to guess our 245 values are longer because we're an academic
> consortium and holdings are biased towards academic titles which tend to be
> longer.
>


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Kyle Banerjee
This squares with what I'm seeing. Data for all holdings of the Orbis
Cascade Alliance is:

100: 30.1
245: 114.1
6XX: 36.1

My values include indicators (2 characters) as well as any delimiters but
not the tag number itself. I breaking up 6XX up as Roy has as 6XX's are far
from created equal and frequency of occurrence varies radically with tag.

I'm going to guess our 245 values are longer because we're an academic
consortium and holdings are biased towards academic titles which tend to be
longer.

kyle


On Wed, Oct 16, 2013 at 10:57 AM, Roy Tennant  wrote:

> I don't even have to fire it up. That's a statistic that we generate
> quarterly (albeit via Hadoop). Here you go:
>
> 100 - 30.3
> 245 - 103.1
> 600 - 41
> 610 - 48.8
> 611 - 61.4
> 630 - 40.8
> 648 - 23.8
> 650 - 35.1
> 651 - 39.6
> 653 - 33.3
> 654 - 38.1
> 655 - 22.5
> 656 - 30.6
> 657 - 27.4
> 658 - 30.7
> 662 - 41.7
>
> Roy
>
>
> On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan  wrote:
>
> > That sounds like a request for Roy to fire up the ole OCLC Hadoop.
> >
> > -Sean
> >
> >
> >
> > On 10/16/13 1:06 PM, "Karen Coyle"  wrote:
> >
> > >Anybody have data for the average length of specific MARC fields in some
> > >reasonably representative database? I mainly need 100, 245, 6xx.
> > >
> > >Thanks,
> > >kc
> > >
> > >--
> > >Karen Coyle
> > >kco...@kcoyle.net http://kcoyle.net
> > >m: 1-510-435-8234
> > >skype: kcoylenet
> >
>


[CODE4LIB] Tool for feedback on document

2013-10-16 Thread Walker, David
Hi all,

We're looking to put together a large policy document, and would like to be 
able to solicit feedback on the text from librarians and staff across two dozen 
institutions.

We could just do that via email, of course.  But I thought it might be better 
to have something web-based.  A wiki is not the best solution here, as I don't 
want those providing feedback to be able to change the text itself, but rather 
just leave comments.

My fall back plan is to just use Wordpress, breaking the document up into 
various pages or posts, which people can then comment on.  But it seems to me 
there must be a better solutions here -- maybe one where people can leave 
comments in line with the text?

Any suggestions?

Thanks,

--Dave 

-
David Walker
Director, Systemwide Digital Library Services
California State University
562-355-4845


Re: [CODE4LIB] local APIs atop III's Sierra DB

2013-10-16 Thread Rob Casson
i've done some very ugly, preliminary hacking at getting MARC records out:

https://gist.github.com/roblivian/7012077

generally "works", but still need to account for more invalid MARC tags,
"on-the-fly" records (non-MARC records, i.e. reserve items, ordered bibs,
etc)



On Wed, Oct 16, 2013 at 10:49 AM, Thomale, Jason wrote:

> Everyone: You guys are fantastic. Thanks to those who have responded thus
> far for being so willing to share. I will be contacting y'all off-list, if
> you don't mind. :-)
>
> Just wanted to tag onto Dave's response here...
>
> > I've written a decent amount of code against Sierra, but I don't know if
> > any of it amounts to an "API".
> >
> ...
> > * I've also started creating little web services with mod_perl for use
> > in a
> > web-application I'm working on.  Examples: a script that spits back item
> > information in JSON when given an item barcode, a script that spits back
> > a
> > JSON list of all attached items when given a bib record number.  Again
> > these are mostly special purpose, but I have a notion to find ways to
> > generalize them.
>
> Yes this is basically where I am right now and where this is coming from.
> I've thrown together sort of a prototype app for helping us with some
> inventory stuff we're doing, which consists of a really quick-and-dirty web
> service that serves up JSON and a bootstrap/jQuery front-end. For what it
> is--which at this point isn't much more than a proof-of-concept--it works.
> But. In the coming year there are a lot of similar things we plan to do,
> and building out a RESTful API to serve up catalog data in particular ways
> seems like a logical step right now.
>
> Julia alluded to "some things you don't want to do when you're querying
> the database," which is something I'm interested in talking about as well.
> If my experiences are anything like yours, Julia, I'm finding things just
> aren't indexed in ways that make it optimal for our use cases. Namely,
> querying on most variable field data is out of the question if you don't
> want multi-minute response times. It seems the only way to get this to work
> well will be to dump portions of the database out to an external document
> store / indexer. I'm primarily looking at serving up JSON at this point, so
> probably something like Solr or Elasticsearch. Learning from your
> experiences building a Sierra driver for VuFind would be quite helpful and
> interesting.
>
> Francis, I'll be interested to see whether you're thinking along similar
> lines or if you're going a totally different direction...
>
> > Sadly, I'm a team of one here and I'm a bit shy about the state my code
> > is
> > currently in, so I haven't published it anywhere.  ( Also the way I use
> > git
> > locally is probably "wrong", not to mention there are probably passwords
> > in
> > old commits. )
>
> No worries! I completely understand, and I share your shyness. Believe me,
> I'm the last person that should judge.
>
> > Nonetheless, I'd definitely be interested in collaborating on anything
> > that
> > might benefit all Sierra users.
>
> Cool. I really appreciate it. I guess--at this point I'm still looking at
> solving local needs first, but making it easy enough to extend to new use
> cases. Or...at the very least doing something that will provide for a good
> learning experience. :-) I don't know, it's still ideas.
>
> Thanks,
>
> Jason
>


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Roy Tennant
I don't even have to fire it up. That's a statistic that we generate
quarterly (albeit via Hadoop). Here you go:

100 - 30.3
245 - 103.1
600 - 41
610 - 48.8
611 - 61.4
630 - 40.8
648 - 23.8
650 - 35.1
651 - 39.6
653 - 33.3
654 - 38.1
655 - 22.5
656 - 30.6
657 - 27.4
658 - 30.7
662 - 41.7

Roy


On Wed, Oct 16, 2013 at 10:38 AM, Sean Hannan  wrote:

> That sounds like a request for Roy to fire up the ole OCLC Hadoop.
>
> -Sean
>
>
>
> On 10/16/13 1:06 PM, "Karen Coyle"  wrote:
>
> >Anybody have data for the average length of specific MARC fields in some
> >reasonably representative database? I mainly need 100, 245, 6xx.
> >
> >Thanks,
> >kc
> >
> >--
> >Karen Coyle
> >kco...@kcoyle.net http://kcoyle.net
> >m: 1-510-435-8234
> >skype: kcoylenet
>


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Bill Dueber
I'm running it against the HathiTrust catalog right now. It'll just take a
while, given that I don't have access to Roy's Hadoop cluster :-)


On Wed, Oct 16, 2013 at 1:38 PM, Sean Hannan  wrote:

> That sounds like a request for Roy to fire up the ole OCLC Hadoop.
>
> -Sean
>
>
>
> On 10/16/13 1:06 PM, "Karen Coyle"  wrote:
>
> >Anybody have data for the average length of specific MARC fields in some
> >reasonably representative database? I mainly need 100, 245, 6xx.
> >
> >Thanks,
> >kc
> >
> >--
> >Karen Coyle
> >kco...@kcoyle.net http://kcoyle.net
> >m: 1-510-435-8234
> >skype: kcoylenet
>



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MARC field lengths

2013-10-16 Thread Sean Hannan
That sounds like a request for Roy to fire up the ole OCLC Hadoop.

-Sean



On 10/16/13 1:06 PM, "Karen Coyle"  wrote:

>Anybody have data for the average length of specific MARC fields in some
>reasonably representative database? I mainly need 100, 245, 6xx.
>
>Thanks,
>kc
>
>-- 
>Karen Coyle
>kco...@kcoyle.net http://kcoyle.net
>m: 1-510-435-8234
>skype: kcoylenet


[CODE4LIB] MARC field lengths

2013-10-16 Thread Karen Coyle
Anybody have data for the average length of specific MARC fields in some 
reasonably representative database? I mainly need 100, 245, 6xx.


Thanks,
kc

--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] pdf2txt

2013-10-16 Thread Kevin Hawkins

On 10/15/13 11:45 AM, Eric Lease Morgan wrote:

On Oct 14, 2013, at 7:56 AM, Nicolas Franck  wrote:


Could this also be done by Apache Tika? Or do I miss a crucial point?

http://tika.apache.org/1.4/gettingstarted.html



Nicolas, this looks VERY promising! It seemingly can extract the OCR from a PDF 
document as well as extract the text from a Word document. 'More experimenting, 
but thank you. code4lib++  --Eric Morgan


In case they are of use to anyone, here are links I've collected over 
the years (some may be dead) to other tools that include the capability 
to extract text from a vector PDF (not a raster one that still needs to 
be OCRd):


* pdfx: http://pdfx.cs.man.ac.uk/

* LA-PDFText: https://code.google.com/p/lapdftext/

* pdf2htmlEX: https://github.com/coolwanglu/pdf2htmlEX

* Apache PDFBox: http://pdfbox.apache.org/

* pdf2txt.py, part of PDFMiner: 
http://www.unixuser.org/~euske/python/pdfminer/


* pdftotext (part of xpdf)

See also the list at http://scholrev.org/hackathon/ and this discussion 
of using Jade, Gemini, and Adobe Acrobat to extract text from a PDF: 
http://www.ncbi.nlm.nih.gov/books/NBK61837/ .


--Kevin


[CODE4LIB] ALCTS Metadata Interest Group - Call for Proposals at ALA Midwinter 2014

2013-10-16 Thread Glendon, Ivey (img7u)
The ALCTS Metadata Interest Group invites speakers to present at the ALA 
Midwinter meeting in Philadelphia on Sunday, January 26, 2014 from 8:30 to 
10am.  Presentations will be approximately 30 minutes, including Q&A.

Our charge is to provide a broad framework for information exchange on current 
research developments, tools, and activities affecting networked information 
resources and metadata; coordinates and actively participate in the development 
and review of standards concerning networked resources and metadata in 
conjunction with the divisions' committees and sections, other units within 
ALA, and relevant outside agencies; and develops programs and fosters and 
sponsors education and training opportunities that contribute to and enhance an 
understanding of networked resources and metadata, their identity, content, 
technology, access, control, and use.

Suggested topics include but are not limited to:

*Tools for metadata librarians, including those for creation and/or 
transformation
*Demos or light tutorials on tools
*Top-level overviews of tools/mechanisms for use in managing metadata
*Innovative workflow design focusing on the intersection of tools/mechanisms 
and staff education and training

Please email proposal abstracts to program co-chairs by Wednesday, October 
30th, 2013.  Please contact us with questions.  Thank you!

Program co-chairs,

Ivey Glendon
Metadata Librarian
University of Virginia Library
(434) 243-0634 | im...@virginia.edu

Santi Thompson
Metadata & Digitization Operations Coordinator
University of Houston Library
(713) 743-9685 | sathomps...@uh.edu


Re: [CODE4LIB] pdf2txt

2013-10-16 Thread Robert Haschart

On 10/15/2013 12:25 PM, Eric Lease Morgan wrote:

On Oct 14, 2013, at 4:49 PM, Robert Haschart  wrote:


For a limited period of time I am making publicly available a Web-based program 
called PDF2TXT --http://bit.ly/1bJRyh8

Although based on some subsequent messages where you mention tesseract
maybe I misunderstood and your tool only handles pdfs that have already
been OCR'ed which would explain why the second document (which only
contains page images) fails.

Robert, that's correct. As of right now the document needs to have been 
previously OCRed. --Eric
The abstract extraction routine I have been working on does use 
tesseract internally for doing OCR when it encounters a document that 
doesn't have usable full-text.  I agree that tesseract is not that easy 
to install, especially if (as in my case) you do not have root/sudo 
access to the machine.  Since I have gone through installing tesseract 
quite recently, perhaps my experience can be helpful to you.


-Bob Haschart


Re: [CODE4LIB] local APIs atop III's Sierra DB

2013-10-16 Thread Thomale, Jason
Everyone: You guys are fantastic. Thanks to those who have responded thus far 
for being so willing to share. I will be contacting y'all off-list, if you 
don't mind. :-)

Just wanted to tag onto Dave's response here...

> I've written a decent amount of code against Sierra, but I don't know if
> any of it amounts to an "API".
> 
...
> * I've also started creating little web services with mod_perl for use
> in a
> web-application I'm working on.  Examples: a script that spits back item
> information in JSON when given an item barcode, a script that spits back
> a
> JSON list of all attached items when given a bib record number.  Again
> these are mostly special purpose, but I have a notion to find ways to
> generalize them.

Yes this is basically where I am right now and where this is coming from. I've 
thrown together sort of a prototype app for helping us with some inventory 
stuff we're doing, which consists of a really quick-and-dirty web service that 
serves up JSON and a bootstrap/jQuery front-end. For what it is--which at this 
point isn't much more than a proof-of-concept--it works. But. In the coming 
year there are a lot of similar things we plan to do, and building out a 
RESTful API to serve up catalog data in particular ways seems like a logical 
step right now.

Julia alluded to "some things you don't want to do when you're querying the 
database," which is something I'm interested in talking about as well. If my 
experiences are anything like yours, Julia, I'm finding things just aren't 
indexed in ways that make it optimal for our use cases. Namely, querying on 
most variable field data is out of the question if you don't want multi-minute 
response times. It seems the only way to get this to work well will be to dump 
portions of the database out to an external document store / indexer. I'm 
primarily looking at serving up JSON at this point, so probably something like 
Solr or Elasticsearch. Learning from your experiences building a Sierra driver 
for VuFind would be quite helpful and interesting.

Francis, I'll be interested to see whether you're thinking along similar lines 
or if you're going a totally different direction...

> Sadly, I'm a team of one here and I'm a bit shy about the state my code
> is
> currently in, so I haven't published it anywhere.  ( Also the way I use
> git
> locally is probably "wrong", not to mention there are probably passwords
> in
> old commits. )

No worries! I completely understand, and I share your shyness. Believe me, I'm 
the last person that should judge.

> Nonetheless, I'd definitely be interested in collaborating on anything
> that
> might benefit all Sierra users.

Cool. I really appreciate it. I guess--at this point I'm still looking at 
solving local needs first, but making it easy enough to extend to new use 
cases. Or...at the very least doing something that will provide for a good 
learning experience. :-) I don't know, it's still ideas.

Thanks,

Jason


[CODE4LIB] NISO Releases Draft Recommended Practice on Indexed Discovery Service for Comments

2013-10-16 Thread Ken Varnum
Of possible interest to this group. I was one of the members of the NISO
ODI group that put together this draft recommendation. Your comments are
welcome at http://www.niso.org/publications/rp/rp-19-201x

Ken Varnum



For release: 16 Oct 2013NISO Releases Draft Recommended Practice on Indexed
Discovery Service for Comments

Baltimore, MD - October 16, 2013 - The National Information Standards
Organization (NISO) is seeking comments on the draft recommended practice *Open
Discovery Initiative: Promoting Transparency in Discovery*. Launched in
2012, the NISO Open Discovery Initiative (ODI) aims to facilitate increased
transparency in the content coverage of index-based discovery services and
to recommend consistent methods of content exchange. This draft recommended
practice provides specific guidelines for content providers on metadata
elements, linking, and technical formats, and for discovery service
providers on content listings, linking, file formats, methods of transfer,
and usage statistics. The document also provides background information on
the evolution of discovery and delivery technology and a standard set of
terminology and definitions for this technology area.

"An increasing number of libraries, especially those that serve academic or
research institutions, have invested in index-based discovery services as a
strategic interface to all their resources," states Marshall Breeding, an
independent library consultant and Co-chair of the ODI Working Group.
"These libraries expect their uniquely licensed and purchased electronic
content to be made available within their discovery service of choice. But
it is often not clear which resources are available, which are indexed in
full text, by citations only, or both, and whether the metadata derives
from aggregated databases or directly through the full text. Libraries
deserve a clear explanation of the degree of availability of their content
in the available discovery services and they need usage statistics for
access from the discovery tool."

"The domain of index-based discovery services involves a complex ecosystem
of interrelating issues and interests among content providers, libraries,
and discovery service creators," explains Jenny Walker, an independent
consultant and Co-chair of the ODI Working Group. "The increasing use of
indexed search as a primary means for library patrons to discover and
access licensed content brings with it new requirements for industry
practices that will ensure consistent provision of metadata, unbiased
linking to source material, and neutrality of algorithms for generating
result sets, relevance rankings, and link order. Specific guidelines around
these issues are given in the ODI Recommended Practice."

"In addition to the recommendations in the current draft, the ODI Working
Group has identified a number of actions for future work," states Nettie
Lagace, NISO Associate Director for Programs. "NISO plans to support this
follow-up effort to address such issues as collaborative discussion
mechanisms, application programming interfaces, handling of restricted
content, on-demand lookup, and interaction with COUNTER about usage
statistics related to discovery services."

The draft recommended practice is open for public comment through November
18, 2013. To download the draft or submit online comments, visit the Open
Discovery Initiative webpage
at:www.niso.org/workrooms/odi/
.

About NISO
NISO fosters the development and maintenance of standards that facilitate
the creation, persistent management, and effective interchange of
information so that it can be trusted for use in research and learning. To
fulfill this mission, NISO engages libraries, publishers, information
aggregators, and other organizations that support learning, research, and
scholarship through the creation, organization, management, and curation of
knowledge. NISO works with intersecting communities of interest and across
the entire lifecycle of an information standard. NISO is a not-for-profit
association accredited by the American National Standards Institute (ANSI).
More information about NISO is available on its website: www.niso.org. For
more information please contact NISO on (301) 654-2512 or via email on
nis...@niso.org.


-- 

--
Ken Varnum | Web Systems Manager | MLibrary - University of Michigan - Ann
Arbor
var...@umich.edu | @varnum | http://www.lib.umich.edu/users/varnum |
734-615-3287


[CODE4LIB] Job: PHP developer at Comperio srl

2013-10-16 Thread jobs
**Who we are:**  
  
Comperio srl is a company focuses its work to the world of libraries,
implementing IT solutions. Since 2004 develops
[ClavisNG](http://www.comperio.it/solutions/clavisng-en-US/an-open-source-ils-
for-libraries-networks/), a web-based Integrated Library System for library
networks, that has been released in 2010 with the AGPL open source license.
Since 2011 he also developed a [module](http://www.comperio.it/solutions
/discoveryng-en-US/overview/) for the CMS Silverstripe able to communicate
with ClavisNG. In addition the company develops RFID
solutions for libraries.

  
**Who we are looking for:**  
  
We are looking for a PHP programmer, preferably with experience, to be
included full-time in the development team of ClavisNG. The ideal candidate
has a good confidence with development of web applications in a LAMP stack
(Linux, Apache, MySQL, PHP), attitude to learning and teamwork.

  
**Place of work:** Rovigo  
  
Full job description (in
Italian):[http://www.comperio.it/about/lavora-con-
noi/](http://www.comperio.it/about/lavora-con-noi/)

  



Brought to you by code4lib jobs: http://jobs.code4lib.org/job/10371/


Re: [CODE4LIB] local APIs atop III's Sierra DB

2013-10-16 Thread Dave Menninger
I've written a decent amount of code against Sierra, but I don't know if
any of it amounts to an "API".

* I've built some utility code that has grown into a handful Perl modules
that I use regularly in creating new reports.  Most of these are
special-purpose for applications we have in-house but I'm trying to find
ways to generalize them.  Examples: wrapper functions around the Patron
Update Web Service, functions to look-up shelf location names, find/clean
up patron data entry errors, etc

* I've also started creating little web services with mod_perl for use in a
web-application I'm working on.  Examples: a script that spits back item
information in JSON when given an item barcode, a script that spits back a
JSON list of all attached items when given a bib record number.  Again
these are mostly special purpose, but I have a notion to find ways to
generalize them.

I'm aware of two github repo's that might be of interest in this
conversation:
* https://github.com/mcoia/sierra_marc_tools
* https://github.com/geekmuse/sierra-perl-scripts

Sadly, I'm a team of one here and I'm a bit shy about the state my code is
currently in, so I haven't published it anywhere.  ( Also the way I use git
locally is probably "wrong", not to mention there are probably passwords in
old commits. )

Nonetheless, I'd definitely be interested in collaborating on anything that
might benefit all Sierra users.

Feel free to contact me off-list if you want to chat more.

~Dave



On Tue, Oct 15, 2013 at 5:36 PM, Julia Bauder wrote:

> Jason,
>
> To expand on Becky's answer a bit: we haven't written our own APIs yet, but
> I did write a Sierra driver for VuFind, so I do have some notes that might
> be useful to you that I'm happy to share. At least, I've learned the hard
> way some things that you don't want to do when you're querying the
> database. ;-)
>
> Julia
>
> *
>
> Julia Bauder
>
> Social Studies and Data Services Librarian
>
> Grinnell College Libraries
>
>  Sixth Ave.
>
> Grinnell, IA 50112
>
>
> 641-269-4431
>
>
>
>
>
> On Tue, Oct 15, 2013 at 2:41 PM, Becky Yoose  wrote:
>
> > Hi Jason,
> >
> > We haven't planned to write our own APIs for Sierra at this point (we're
> > still working on getting Sierra to work in the first place), but Grinnell
> > would be interested in seeing how the process goes for you in terms of
> > local API building.
> >
> > As for the Sierra APIs - III just hired a new API project manager (the
> one
> > that attended #c4l13 has since left the company) so I'm not sure what's
> all
> > going on. They are still saying that patron facing APIs will be out by
> > winter, though I'd wish the staff facing APIs would get some love too...
> >
> > Thanks,
> > Becky
> >
> > -
> > Becky Yoose
> > Discovery and Integrated Systems Librarian
> > Grinnell College
> >
> >
> > On Tue, Oct 15, 2013 at 2:29 PM, Thomale, Jason  > >wrote:
> >
> > > Hello Code4lib,
> > >
> > > I'm wondering if any III Sierra users out there have worked on building
> > an
> > > API for accessing their ILS data on top of Sierra's Postgres database.
> > > Right now I'm looking into possibly building something to serve local
> > needs
> > > and use cases, as we're not terribly confident that III's forthcoming
> > > APIs--if they are indeed forthcoming--will really fit the bill.
> > >
> > > If this is something you're doing or have considered doing and wouldn't
> > > mind comparing notes, please drop me a line! Thanks.
> > >
> > > Jason Thomale
> > > Resource Discovery Systems Librarian
> > > University of North Texas
> > >
> >
>


[CODE4LIB] Hackathon in Philly, January 24th

2013-10-16 Thread Chris Strauber
Hey all,

I'd like to invite you to a hackathon in Philadelphia on January 24th. It's
being sponsored by ALA's Library Code Year IG (with help from LITA and
ALCTS) at the Penn Special Collections center. We'll be hacking on several
Worldcat APIs and the DPLA API, and there will be coders from both
organizations present to help introduce them and participate. It's open to
all skill levels: we'll be using the Worldcat APIs as a beginners' track,
and DPLA for the more self-directed. It's not an official ALA event, so no
need to register for the conference to come. More details here: <
http://www.libhack.org>.

Chris Strauber
Co-chair, Code Year IG
Tufts University


[CODE4LIB] Job: Digital Humanist at University of Cologne

2013-10-16 Thread jobs
The Cologne Center for eHumanities (CCeH) at the University of
Cologneis seeking a Digital Humanist for a position as
research associate(50%) at the earliest date possible,
initially for a period of 14months.

  
Applicants should ideally possess most of the following skills
andcompetences or, if necessary, acquire them in a timely
manner:

  * Modeling of research agendas and information resources
  * Metadata and data standards (e.g. DC, TEI, METS, EAD, CMDI)
  * XML technologies (XML, XSLT, xQuery, XML-data bases)
  * Web technologies (HTML, CSS, Javascript)
  * Programming (Python, Java etc.)
  * Design and creation of web applications regarding technology, concept & 
content
Excellent communicative and organizational skills are required as
wellas the ability and willingness to both work in a team
and manageprojects independently.

  
A background in the humanities would be an advantage. The
upcomingprojects span, among other fields of study,
linguistics, philologies,history and archaeology. They
involve dictionaries, palaeography,papyrology, cataloging,
edition etc.

  
Anyone interested in the position may contact the CCeH directly:

  
info-c...@uni-koeln.de

  
or by phone

0049-221-470-3894 resp. 4056

  
http://www.cceh.uni-koeln.de/



Brought to you by code4lib jobs: http://jobs.code4lib.org/job/10370/