Re: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

2013-02-27 Thread Reese, Terry
Kyle -- if this was me -- I'd break the file into a database.  You have a lot 
of different options, but the last time I had to do something like this -- I 
broke the data into 10 tables -- a control table with a primary key and oclc 
number, a table for 0xx fields, a table for 1xx, 2xx, etc.  including OCLC 
number and key that they relate too.  You can actually do this with MarcEdit 
(if you have mysql installed) -- but on a laptop -- I'm not going to guarantee 
speed with the process.  Plus, the process to generate the SQL data will be 
significant.  It might take 15 hours to generate the database, but then you'd 
have it and could create indexes on it.  But you could use it to create the 
database and then prep the files for later work.

--TR

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
Banerjee
Sent: Wednesday, February 27, 2013 9:45 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently

I'm involved in a migration project that requires identification of local 
information in millions of MARC records.

The master records I need to compare with are 14GB total. I don't know what the 
others will be, but since the masters are deduped and the source files aren't 
(plus they contain loads of other garbage), there will be considerably more. 
Roughly speaking, if I compare 1000 master records per second, it would take 
about 2 1/2 hours to cut through the file. I need to be able to ask the file 
whatever questions the librarians might have (i.e.
many), so speed is important.

For reasons I won't go into right now, I'm stuck doing this on my laptop in 
cygwin right now and that affects my range of motion.

I'm trying to figure out the best way to proceed. Currently, I'm extracting 
specific fields for comparison. Each field tag gets a single line keyed by OCLC 
number (repeated fields are catted together with a delimiter). The idea is that 
if I deal with only one field at a time, I can slurp the master info in memory 
and retrieve it via hash (OCLC control number) as I loop through the comparison 
data. Local data will either be stored in special files that are loaded 
separately from the bibs or recorded in reports for maintenance projects

This process is clunky because a special comparison file has to be created for 
each question, but it does seem to work (generating preprocess files and then 
doing the compare is measured in minutes rather than hours). I didn't use a DB 
because there's no way I could store the reference data in memory and I figured 
I'd just thrash my drive.

Is this a reasonable approach, and whether or not it is, what tools should I be 
thinking of using for this? Thanks,

kyle


Re: [CODE4LIB] wiki page about the chode4lib irc bot created

2013-01-24 Thread Reese, Terry
 Looking at that, the only absolutely library-specific content there
 appears to be the MARC plugin (which isn't documented in detail).
MARC and not well documented...that sounds about right.  

--tr

*
Terry Reese, Associate Professor
Gray Family Chair for
Innovative Library Services
121 Valley Library
Corvallis, OR  97331
tel: 541.737.6384
*




Re: [CODE4LIB] code4lib.org domain

2012-12-18 Thread Reese, Terry
Wilhelmina,

To answer your two questions.
1) yes, during the 30 day expiration period when registration lapses your site 
will typically become unavailable
2) this isn't just about one person at OSU.  Ryan Ordway is our sys admin, but 
c4l is supported by a number of folks at the institution in various 
capacities...up to the director.  Were Ryan to leave, the process for 
maintaining the infrastructure would simply fall to someone else at the Library.

Tr



*
Terry Reese, Associate Professor
Gray Family Chair for
Innovative Library Services
121 Valley Library
Corvallis, OR 97331
541.737.6384


From: Wilhelmina Randtke
Sent: 12/18/2012 2:00 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] code4lib.org domain

Pay for it shouldn't be an issue.  It's like $10 a year to register the
domain, right?  So, don't make a big deal out of OSU paying for it.  The
fee is negligible.

The key concern is how committed to OSU is Ryan Ordway, and what's the
climate there like.  I see this as transferring to the people who are
currently technical contacts at OSU, not to a faceless organization.  If
they already hold several other URLs, and have a policy and timeframe for
tracking and renewing these then that's a plus.

Also, I asked before, and I'm going to ask again, will the domain stop
working (so stop pointing at nameservers) during the redemption period?  If
so, then a worst case scenario is not too bad, because there will be some
warning and a late fee assuming the registered owner can be contacted,
rather than just loosing the domain if the bill isn't paid.

-Wilhelmina Randtke


On Tue, Dec 18, 2012 at 3:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 I definitely see what you're saying, but think there are pro's and con's
 both ways.

 OSU is already responsible for the bulk of our infrastructure too, adding
 the DNS would be minor.

 But there are definitely pro's (as well as con's) to individual and/or
 non-institutional ownership/responsibility/**management, compared to
 institutional.

 In the end, as with much Code4Lib, as with much volunteer projects -- what
 it comes down to is who's offering to volunteer to do it. OSU is offering
 to volunteer to do it (and pay for it, apparently?), and we obviously find
 OSU to be generally responsible, since they host the rest of our
 infrastructure.

 Someone offering to do it right now, someone we find generally responsible
 -- always beats the hypothetical other solution that has nobody actually
 volunteering to do it.

 So, Wilhelmina, are you volunteering to run the DNS instead? :) (and pay
 for it, or fundraise to pay for it)  If you are, then we might have two
 options. Otherwise, we've got one, and no reason to reject it unless we
 thought OSU was not trustworthy with the responsibility or something (which
 if we did, would be a big problem, since they already responsible for a lot
 more than that).


 On 12/18/2012 4:34 PM, Wilhelmina Randtke wrote:

 I'm for individual ownership and management over organizational.
 Organizations tend to not have written documentation, and to rely on
 institutional memory.  I see two things going wrong:  Contact at OSU
 leaves
 OSU and no one thinks to renew domain, or OSU doesn't have a dedicated
 contact and at some point they don't renew because they don't see the
 value.

 Also important:  OSU is on state funding cycles, so may have some rule
 against renewing for more than a year at a time.  So, the deadline to
 renew
 will come more frequently than it would with unrestricted funds and the
 ability to renew for 5 or 10 years at a time.

 When the domain expires, it will go into a redemption period of about a
 month.  I remember what the whois record looks like for domains in the
 redemption period, and whois does give the contact information.  Does the
 URL stop working during this period?  If so, then that's great because if
 there is a problem with a renewal then many people will notice the URL not
 working, and be able to check the status of the domain and get on it.

 -Wilhelmina Randtke


 On Tue, Dec 18, 2012 at 2:32 PM, Ed Summers e...@pobox.com wrote:

  HI all,

 I've owned the code4lib.org since 2005 and have been thinking it might
 be wise for to transfer ownership of it to someone else. Sometimes I
 forget to pay bills, and miss emails, and it seems like the domain
 means something to a larger group of people.

 With Ryan Ordway's help Oregon State University indicated they would
 be willing to take over administration of the domain. They also have
 been responsible for running the Drupal instance at code4lib.org and
 the Mediawiki instance at wiki.code4lib.org -- so it seems like a
 logical move.

 But I thought I would bring it up here first in the interests of
 transparency, community building and whatnot, to see if there were any
 objections or ideas.

 //Ed






Re: [CODE4LIB] Leader in MarcXML Files ( Record Length )

2012-06-29 Thread Reese, Terry
I wouldn't.  One of the benefits of marcxml is that you are not constrained by 
marcs record length issues.  Deciding to calculate that value would add an 
arbitrary length limitation to the format (in my opinion).

Tr

*
Terry Reese, Associate Professor
Gray Family Chair for
Innovative Library Services
121 Valley Library
Corvallis, OR 97331
541.737.6384


From: Sullivan, Mark V
Sent: 6/29/2012 6:52 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Leader in MarcXML Files ( Record Length )

All,



I received a question regarding a software library I have created and released 
as open source.  The record length in the leader ( positions 0-4 ) was not 
being calculated correctly when writing as MarcXML.  However, this raises a 
more philosophical and larger question.  What is the point of the first five 
digits of the leader, outside of a ISO2709 / MARC21 encoded record?   Should I 
calculate the record length AS IF it would be encoded in ISO2709? This would be 
computationally non-trivial and would likely double the time necessary for my 
software to write a MarcXML file. Should I just make the first five digits of 
the leader '0', since it means nothing in the context of a MarcXML file?



Has anyone else pondered this question or have any input on how current systems 
work?



Keep in mind I could be writing a MarcXML record for a record created or 
modified in memory, so just using a pre-existing record length is not an option.



Many thanks for your consideration.


Mark V Sullivan
Digital Development and Web Coordinator
Technology and Support Services
University of Florida Libraries
352-273-2907 (office)
352-682-9692 (mobile)
mars...@uflib.ufl.edumailto:mars...@uflib.ufl.edu


Re: [CODE4LIB] Leader in MarcXML Files ( Record Length )

2012-06-29 Thread Reese, Terry
If im writing marcxml from scratch, I agree.  If I'm converting it from marc, i 
print out the length value from the record more for historical purposes.

Tr

*
Terry Reese, Associate Professor
Gray Family Chair for
Innovative Library Services
121 Valley Library
Corvallis, OR 97331
541.737.6384


From: Devon
Sent: 6/29/2012 7:09 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Leader in MarcXML Files ( Record Length )

When writing MARC XML, you should use zeros. The following document
[1] says you can use blanks, but the schema [2] uses a pattern that
indicates digits should be used. When reading MARC XML, you should
just ignore whatever is in those positions.

[1] http://www.loc.gov/standards/marcxml/marcxml-design.html
[2] http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd

/dev

On Fri, Jun 29, 2012 at 9:51 AM, Sullivan, Mark V mars...@uflib.ufl.edu wrote:
 All,



 I received a question regarding a software library I have created and 
 released as open source.  The record length in the leader ( positions 0-4 ) 
 was not being calculated correctly when writing as MarcXML.  However, this 
 raises a more philosophical and larger question.  What is the point of the 
 first five digits of the leader, outside of a ISO2709 / MARC21 encoded 
 record?   Should I calculate the record length AS IF it would be encoded in 
 ISO2709? This would be computationally non-trivial and would likely double 
 the time necessary for my software to write a MarcXML file. Should I just 
 make the first five digits of the leader '0', since it means nothing in 
 the context of a MarcXML file?



 Has anyone else pondered this question or have any input on how current 
 systems work?



 Keep in mind I could be writing a MarcXML record for a record created or 
 modified in memory, so just using a pre-existing record length is not an 
 option.



 Many thanks for your consideration.


 Mark V Sullivan
 Digital Development and Web Coordinator
 Technology and Support Services
 University of Florida Libraries
 352-273-2907 (office)
 352-682-9692 (mobile)
 mars...@uflib.ufl.edumailto:mars...@uflib.ufl.edu



--
Sent from my GMail account.


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Reese, Terry
I would really consider SAX.  In MarcEdit, I had originally utilized an XSLT 
process for handling MARCXML translations (using both SAXON and MSXML parsers) 
-- but as you noticed -- there ends up being an upper limit to what you can 
process.  The break point for me was when working with some researchers 
experimenting with data from the HathiTrust and they had a 32 GB XML file of 
MARCXML that needed to be processed.  Using the DOM model, the process was 
untenable.  Re-working the code so that it was SAX based -- required building, 
to some degree, the same type of templating to react to specific elements and 
nested elements -- but shifted processing time so that it took ~8 minutes to 
translate those 32 GBs of MARCXML data into MARC (and allowed me to include 
code that handled some common issues related to field length, etc. at the point 
of translation).

Not knowing what your XML files look like, my guess is that if you do it right, 
you can template your SAX code in such a way that the actual processing code is 
smaller and much more efficient than anything you could create using a DOM 
method.

--tr

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
Banerjee
Sent: Friday, June 08, 2012 11:36 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Best way to process large XML files

I'm working on a script that needs to be able to crosswalk at least a couple 
hundred XML files regularly, some of which are quite large.

I've thought of a number of ways to go about this, but I wanted to bounce this 
off the list since I'm sure people here deal with this problem all the time. My 
goal is to make something that's easy to read/maintain without pegging the CPU 
and consuming too much memory.

The performance and load I'm seeing from running the files through LibXML and 
SimpleXML on the large files is completely unacceptable. SAX is not out of the 
question, but I'm trying to avoid it if possible to keep the code more compact 
and easier to read.


I'm tempted to streamedit out all line breaks since they occur in unpredictable 
places and put new ones at the end of each record into a temp file. Then I can 
read the temp file one line at a time and process using SimpleXML. That way, 
there's no need to load giant files into memory, create huge arrays, etc and 
the code would be easy enough for a 6th grader to follow. My proposed method 
doesn't sound very efficient to me, but it should consume predictable resources 
which don't increase with file size.

How do you guys deal with large XML files? Thanks,

kyle

rantWhy the heck does the XML spec require a root element, particularly since 
large files usually consist of a large number of records/documents? This makes 
it absolutely impossible to process a file of any size without resorting to SAX 
or string parsing -- which takes away many of the advantages you'd normally 
have with an XML structure. /rant

--
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787


Re: [CODE4LIB] more on MARC char encoding

2012-04-20 Thread Reese, Terry
Dealing with smart quotes is easy -- dealing with chemistry and mathematics 
symbols is much more challenging because there is so much variety.  If you sent 
me some example documents off list so I could put together some sample files, I 
could take a closer look, but couldn't make any promises outside of the general 
smart quote issue.

--TR

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Deng, 
Sai
Sent: Friday, April 20, 2012 6:55 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] more on MARC char encoding

If a canned cleaner can be added in MarcEdit to deal with smart 
quotes/values, that will be great! Besides the smart quotes, please consider 
other special characters including Chemistry and mathematics symbols (these are 
different types of special characters, right?) To better understand the 
character encoding issue, can anybody point me to some resources or list like 
UTF8 encoded data but not in the MARC8 character set? Thanks a lot.
Sophie

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Jonathan Rochkind
Sent: Thursday, April 19, 2012 2:14 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] more on MARC char encoding

Ah, thanks Terry.

That canned cleaner in MarcEdit sounds potentially useful -- I'm in a 
continuing battle to keep the character encoding in our local marc corpus clean.

(The real blame here is on cataloger interfaces that let catalogers save data 
that are illegal bytes for the character set it's being saved as. 
And/or display the data back to the cataloger using a translation that lets 
them show up as expected even though they are _wrong_ for the character set 
being saved as.  Connexion is theoretically the rolls royce of cataloger 
interfaces, does it do this? Gosh I hope not.)

On 4/19/2012 2:20 PM, Reese, Terry wrote:
 Actually -- the issue isn't one of MARC8 versus UTF8 (since this data is 
 being harvested from DSpace and is UTF8 encoded).  It's actually an issue 
 with user entered data -- specifically, smart quotes and the like.  These 
 values obviously are not in the MARC8 characterset and cause many who 
 transform user entered data (which tend to be used by default on Windows) 
 from XML to MARC.  If you are sticking with a strickly UTF8 based system, 
 there generally are not issues because these are valid characters.  If you 
 move them into a system where the data needs to be represented in MARC -- 
 then you have more problems.

 We do a lot of harvesting, and because of that, we run into these types of 
 issues moving data that is in UTF8, but has characters not represented in 
 MARC8, from into Connexion and having some of that data flattened.  Given the 
 wide range of data not in the MARC8 set that can show up in UTF8, it's not a 
 surprise that this would happen.  My guess is that you could add a template 
 to your XSLT translation that attempted to filter the most common forms of 
 these smart quotes/values and replace them with the more standard values.  
 Likewise, if there was a great enough need, I could provide a canned cleaner 
 in MarcEdit that could fix many of the most common varieties of these smart 
 quotes/values.

 --TR

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
 Of Jonathan Rochkind
 Sent: Thursday, April 19, 2012 11:13 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] more on MARC char encoding

 If your records are really in MARC8 not UTF8, your best bet is to use a tool 
 to convert them to UTF8 before hitting your XSLT.

 The open source 'yaz' command line tools can do it for Marc21.

 The Marc4J package can do it in java, and probably work for any MARC variant 
 not just Marc21.

 Char encoding issues are tricky. You might want to first figure out if your 
 records are really in Marc8, thus the problems, or if instead they illegally 
 contain bad data or data in some other encoding (Latin1).

 Char encoding is a tricky topic, you might want to do some reading on it in 
 general. The Unicode docs are pretty decent.

 On 4/19/2012 11:06 AM, Deng, Sai wrote:
 Hi list,
 I am a Metadata librarian but not a programmer, sorry if my question seems 
 naïve. We use XSLT stylesheet to transform some harvested DC records from 
 DSpace to MARC in MarcEdit, and then export them to OCLC.
 Some characters do not display correctly and need manual editing, for 
 example:
 In MarcEditor
 Transferred to OCLC   Edit in OCLC
 Bayes’ theorem   
 Bayes⁰́₉ theorem  Bayes' theorem
 ―it won‘t happen here‖ attitude  ⁰́₅it won⁰́₈t happen here⁰́₆ 
 attitude   it won't happen here attitude
 “Generation Y”   ⁰́₋Generation 
 Y

Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread Reese, Terry
Actually -- the issue isn't one of MARC8 versus UTF8 (since this data is being 
harvested from DSpace and is UTF8 encoded).  It's actually an issue with user 
entered data -- specifically, smart quotes and the like.  These values 
obviously are not in the MARC8 characterset and cause many who transform user 
entered data (which tend to be used by default on Windows) from XML to MARC.  
If you are sticking with a strickly UTF8 based system, there generally are not 
issues because these are valid characters.  If you move them into a system 
where the data needs to be represented in MARC -- then you have more problems.  

We do a lot of harvesting, and because of that, we run into these types of 
issues moving data that is in UTF8, but has characters not represented in 
MARC8, from into Connexion and having some of that data flattened.  Given the 
wide range of data not in the MARC8 set that can show up in UTF8, it's not a 
surprise that this would happen.  My guess is that you could add a template to 
your XSLT translation that attempted to filter the most common forms of these 
smart quotes/values and replace them with the more standard values.  
Likewise, if there was a great enough need, I could provide a canned cleaner in 
MarcEdit that could fix many of the most common varieties of these smart 
quotes/values.  

--TR

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Jonathan Rochkind
Sent: Thursday, April 19, 2012 11:13 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] more on MARC char encoding

If your records are really in MARC8 not UTF8, your best bet is to use a tool to 
convert them to UTF8 before hitting your XSLT.

The open source 'yaz' command line tools can do it for Marc21.

The Marc4J package can do it in java, and probably work for any MARC variant 
not just Marc21.

Char encoding issues are tricky. You might want to first figure out if your 
records are really in Marc8, thus the problems, or if instead they illegally 
contain bad data or data in some other encoding (Latin1).

Char encoding is a tricky topic, you might want to do some reading on it in 
general. The Unicode docs are pretty decent.

On 4/19/2012 11:06 AM, Deng, Sai wrote:
 Hi list,
 I am a Metadata librarian but not a programmer, sorry if my question seems 
 naïve. We use XSLT stylesheet to transform some harvested DC records from 
 DSpace to MARC in MarcEdit, and then export them to OCLC.
 Some characters do not display correctly and need manual editing, for example:
 In MarcEditor 
 Transferred to OCLC   Edit in OCLC
 Bayes’ theorem
 Bayes⁰́₉ theorem  Bayes' theorem
 ―it won‘t happen here‖ attitude   ⁰́₅it won⁰́₈t happen here⁰́₆ 
 attitude   it won't happen here attitude
 “Generation Y”⁰́₋Generation 
 Y⁰́₊  Generation Y
 listeners‟ evaluations  listeners⁰́Ÿ evaluations  
 listeners' evaluations
 high school – from   high school ⁰́₃ from 
   high school – from
 Co₀․₅Zn₀․₅Fe₂O₄
 Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴   
 Co0.5Zn0.5Fe2O4?
 μÎơ   
   
  μ
 Nafion®  Nafion℗ʼ 
  Nafion®
 LévyL©♭vy 

 Lévy
 43±13.20 years   
 43℗ł13.20 years  43±13.20 
 years
 12.6 ± 7.05 ft∙lbs12.6 ℗ł 7.05 ft⁸́₉lbs   
12.6 ± 7.05 ft•lbs
 ‘Pouring on the Pounds'⁰́₈Pouring on the Pounds'  
 'Pouring on the Pounds'
 k-ε turbulence k-Îæ 
 turbulence k-ε turbulence
 student—neither parents   student⁰́₄neither parents   
 student-neither parents
 Λ = M – {p1, p2,…,pκ} Î₎ = M ⁰́₃ {p1, p2,⁰́Œ,pÎð} 
   ? (won’t save)
 M = (0, δ)x × Y  M = (0, Îþ)x 
 ©₇ Y?
 100° 

[CODE4LIB] Code4Lib West Registration Form: July 30, 2012

2012-04-17 Thread Reese, Terry
The University of Oregon Libraries and Oregon State University Libraries invite 
you to code4lib west, Monday, July 30, 2012, at the UO Knight Library. There is 
no registration fee for this conference. Registration is limited to 50 
participants. All participants are expected to deliver a lightning talk. In the 
event registration fills up quickly, limits on participation per institution 
may be employed. Your registration is not confirmed until you receive an email. 
Registrations will be confirmed by April 30, 2012.

URL: 
https://docs.google.com/spreadsheet/viewform?formkey=dGRFM0Zob1dsNEE2RU9VY25SNlllUEE6MQ

--TR

***
Terry Reese, Associate Professor
Gray Family Chair for
Innovative Library Services
121 Valley Library
Corvallis, OR 97331
tel: 541.737.6384
***


[CODE4LIB] Save the data for Code4Lib West; July 30, 2012

2012-04-04 Thread Reese, Terry
The University of Oregon Libraries and Oregon State University Libraries invite 
you to code4lib west, Monday, July 30, 2012, at the UO Knight Library.

There is no registration fee for this conference.

Registration is limited to 50 participants. All participants are expected to 
deliver a lightning talk. In the event registration fills up quickly, limits on 
participation per institution may be employed.

The conference will be a combination of lightning talks, code/system 
troubleshooting, and birds of feather groups.

See http://oregondigital.org/digcol/code4libwest/  for more information.

-Karen and Terry

***
Karen Estlund
Digital Library Services, Head
Oregon Digital Newspaper Program, Director
University of Oregon Libraries
Eugene, OR 97403-1299
541-346-185
kestl...@uoregon.edumailto:kestl...@uoregon.edu

Terry Reese, Associate Professor
Gray Family Chair for
Innovative Library Services
121 Valley Library
Corvallis, OR 97331
tel: 541.737.6384
***






Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-08 Thread Reese, Terry
This is one of the reasons you really can't trust the information found in 
position 9.  This is one of the reasons why when I wrote MarcEdit, I utilize a 
mixed process when working with data and determining characterset -- a process 
that reads this byte and takes the information under advisement, but in the end 
treats it more as a suggestion and one part of a larger heuristic analysis of 
the record data to determine whether the information is in UTF8 or not.  
Fortunately, determining if a set of data is in UTF8 or something else, is a 
fairly easy process.  Determining the something else is much more difficult, 
but generally not necessary.  

For that reason, if I was advising other people working on MARC processing 
libraries, I'd advocate having a process for recognizing that certain 
informational data may not be set correctly, and essentially utilize a 
compatibility process to read and correct them.  Because unfortunately, while 
the number of vendors and systems that set this encoding byte correctly has 
increased dramatically (it used to be pretty much no one) -- but it's still so 
uneven, I generally consider this information unreliable.

--TR

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Godmar 
Back
Sent: Thursday, March 08, 2012 11:01 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded 
III records

On Thu, Mar 8, 2012 at 1:46 PM, Terray, James james.ter...@yale.edu wrote:

 Hi Godmar,

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
 ordinal not in range(128)

 Having seen my fair share of these kinds of encoding errors in Python, 
 I can speculate (without seeing the pymarc source code, so please 
 don't hold me to this) that it's the Python code that's not set up to 
 handle the UTF-8 strings from your data source. In fact, the error 
 indicates it's using the default 'ascii' codec rather than 'utf-8'. If 
 it said 'utf-8' codec can't decode..., then I'd suspect a problem with the 
 data.

 If you were to send the full traceback (all the gobbledy-gook that 
 Python spews when it encounters an error) and the version of pymarc 
 you're using to the program's author(s), they may be able to help you out 
 further.


My question is less about the Python error, which I understand, than about the 
MARC record causing the error and about how others deal with this issue (if 
it's a common issue, which I do not know.)

But, here's the long story from pymarc's perspective.

The record has leader[9] == 'a', but really, truly contains ANSEL-encoded data. 
 When reading the record with a MARCReader(to_unicode = False) instance, the 
record reads ok since no decoding is attempted, but attempts at writing the 
record fail with the above error since pymarc attempts to
utf8 encode the ANSEL-encoded string which contains non-ascii chars such as
0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see [1]).

When reading the record with a MARCReader(to_unicode=True) instance, it'll 
throw an exception during marc_decode when trying to utf8-decode the 
ANSEL-encoded string. Rightly so.

I don't blame pymarc for this behavior; to me, the record looks wrong.

 - Godmar

(ps: that said, what pymarc does fails in different circumstances - from what I 
can see, pymarc shouldn't assume that it's ok to utf8-encode the field data if 
leader[9] is 'a'.  For instance, this would double-encode correctly encoded 
Marc/Unicode records that were read with a
MARCReader(to_unicode=False) instance. But that's a separate issue that is not 
my immediate concern. pymarc should probably remember if a record needs or does 
not need encoding when writing it, rather than consulting the leader[9] field.)


(*)
https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef904c24baee6


Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-08 Thread Reese, Terry
Ed, 

Sure -- but this is one part of a much larger process.  MarcEdit has two MARC 
algorithms, one that is a strict processing algorithm, and one that is a loose 
processing algorithm that is able to process data that would be otherwise 
invalid for most processors (and this is done because in the real world, 
vendors send bad records...often.  Anyway, the character encoding is actually 
one of the last things MarcEdit does before writing the processed file to disk. 
 The reason for this is that MarcEdit reads and interacts with MARC data at the 
bit level, meaning characterset is pretty meaningless for the vast majority of 
the work that it does.  When writing to disk though, .NET requests the 
filestream to be set to the correct encoding, otherwise data can be flattened 
and diacritics lost.  

Essentially at that last step, the record is passed to a function called 
RecognizeUTF8 that takes a byte array.  The program then enumerates the bytes 
to determine if the record is recognizable as UTF8 using a process based 
loosely around some of the work done by the International Components for 
Unicode (http://site.icu-project.org/) -- who have some incredible C libraries 
that do much more than you'd ever need to know how to do.  While these don't 
work in C# -- they demonstrate some well-known methods for evaluating byte 
level data for code page evaluation.

Of course, one area where I split directions is that I'm not interested in 
other charactersets and MARC data with poorly coded UTF8 data needs to be 
forced to render as MARC8 (my opinion) until the invalid characters are 
corrected.  So, in my process, invalid UTF8 data will flag the process and 
force data output in the mnemonic data format I use for MARC8 encoded data.

Does that make sense?

--TR

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed 
Summers
Sent: Thursday, March 08, 2012 12:19 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded 
III records

Hi Terry,

On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry terry.re...@oregonstate.edu 
wrote:
 This is one of the reasons you really can't trust the information found in 
 position 9.  This is one of the reasons why when I wrote MarcEdit, I utilize 
 a mixed process when working with data and determining characterset -- a 
 process that reads this byte and takes the information under advisement, but 
 in the end treats it more as a suggestion and one part of a larger heuristic 
 analysis of the record data to determine whether the information is in UTF8 
 or not.  Fortunately, determining if a set of data is in UTF8 or something 
 else, is a fairly easy process.  Determining the something else is much more 
 difficult, but generally not necessary.

Can you describe in a bit more detail how MARCEdit sniffs the record to 
determine the encoding? This has come up enough times w/ pymarc to make it 
worth implementing.

//Ed


Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

2012-03-08 Thread Reese, Terry
 I also used to think it would be cool if we could get MARC8 
 encoding/decoding into the Python standard library, but then I realized I'd 
 rather work on other stuff while MARC8 withers and dies.

Wouldn't that be nice.  In MarcEdit, all data wants to be treated as UTF8, 
MARC8 support is there as a legacy.  Which is why processing MARC8 data in 
MarcEdit is slightly slower than UTF8 (because there is a kind of emulation 
that occurs to translate charactersets on the fly when needed).

--TR

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Gabriel 
Farrell
Sent: Thursday, March 08, 2012 12:19 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded 
III records

Sounds like what you do, Terry, and what we need in PyMARC, is something like 
UnicodeDammit [0]. Actually handling all of these esoteric encodings would be 
quite the chore, though.

I also used to think it would be cool if we could get MARC8 encoding/decoding 
into the Python standard library, but then I realized I'd rather work on other 
stuff while MARC8 withers and dies.


[0] https://github.com/bdoms/beautifulsoup/blob/master/BeautifulSoup.py#L1753

On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry terry.re...@oregonstate.edu 
wrote:
 This is one of the reasons you really can't trust the information found in 
 position 9.  This is one of the reasons why when I wrote MarcEdit, I utilize 
 a mixed process when working with data and determining characterset -- a 
 process that reads this byte and takes the information under advisement, but 
 in the end treats it more as a suggestion and one part of a larger heuristic 
 analysis of the record data to determine whether the information is in UTF8 
 or not.  Fortunately, determining if a set of data is in UTF8 or something 
 else, is a fairly easy process.  Determining the something else is much more 
 difficult, but generally not necessary.

 For that reason, if I was advising other people working on MARC processing 
 libraries, I'd advocate having a process for recognizing that certain 
 informational data may not be set correctly, and essentially utilize a 
 compatibility process to read and correct them.  Because unfortunately, while 
 the number of vendors and systems that set this encoding byte correctly has 
 increased dramatically (it used to be pretty much no one) -- but it's still 
 so uneven, I generally consider this information unreliable.

 --TR

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
 Of Godmar Back
 Sent: Thursday, March 08, 2012 11:01 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and 
 misencoded III records

 On Thu, Mar 8, 2012 at 1:46 PM, Terray, James james.ter...@yale.edu wrote:

 Hi Godmar,

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
 ordinal not in range(128)

 Having seen my fair share of these kinds of encoding errors in 
 Python, I can speculate (without seeing the pymarc source code, so 
 please don't hold me to this) that it's the Python code that's not 
 set up to handle the UTF-8 strings from your data source. In fact, 
 the error indicates it's using the default 'ascii' codec rather than 
 'utf-8'. If it said 'utf-8' codec can't decode..., then I'd suspect a 
 problem with the data.

 If you were to send the full traceback (all the gobbledy-gook that 
 Python spews when it encounters an error) and the version of pymarc 
 you're using to the program's author(s), they may be able to help you out 
 further.


 My question is less about the Python error, which I understand, than 
 about the MARC record causing the error and about how others deal with 
 this issue (if it's a common issue, which I do not know.)

 But, here's the long story from pymarc's perspective.

 The record has leader[9] == 'a', but really, truly contains 
 ANSEL-encoded data.  When reading the record with a 
 MARCReader(to_unicode = False) instance, the record reads ok since no 
 decoding is attempted, but attempts at writing the record fail with 
 the above error since pymarc attempts to
 utf8 encode the ANSEL-encoded string which contains non-ascii chars 
 such as
 0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see [1]).

 When reading the record with a MARCReader(to_unicode=True) instance, it'll 
 throw an exception during marc_decode when trying to utf8-decode the 
 ANSEL-encoded string. Rightly so.

 I don't blame pymarc for this behavior; to me, the record looks wrong.

  - Godmar

 (ps: that said, what pymarc does fails in different circumstances - 
 from what I can see, pymarc shouldn't assume that it's ok to 
 utf8-encode the field data if leader[9] is 'a'.  For instance, this 
 would double-encode correctly encoded Marc/Unicode records that were 
 read with a
 MARCReader(to_unicode=False) instance. But that's a separate issue

Re: [CODE4LIB] MarcEdit command line tool

2011-09-01 Thread Reese, Terry
Here's the problem -- you are missing a switch.  In MarcEdit, the XSLT 
conversations run through Marc21XML.  To move from MARC21XML to MARC, MarcEdit 
uses a crosswalk to the mnemonic format.  When you use the GUI -- this value is 
set for you -- but since it is user configurable, the command-line requires you 
to set it.  So, for example, here's an example of how it would look running on 
my machine (below).

--TR

Here's a full commandline example:

c:\Program Files\MarcEdit 5.0cmarcedit -s c:\users\reeset\desktop\2011.xml -d 
c:\users\reeset\desktop\2011b.mrc -xmlmarc -mxslt c:\program files\MarcEdit 
5.0\xslt\MARC21slim2Mnemonic.xsl 

Beginning Process...
2 records have been processed in 2.686151 seconds.

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Crowe, 
Sean (crowesn)
Sent: Thursday, September 01, 2011 1:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] MarcEdit command line tool

I'm scripting some batch editing routines and I'd like to use the MarcEdit 
command line tool to convert Marc21XML to MARC. I can successfully convert 
using the 5.2 gui but no dice using the command line tool. To this point I've 
only every used the command line tool to break and make marc files.

Here is my syntax:
C:\Program Files\MarcEdit 5.0cmarcedit.exe -s xmltest.xml -d xmltest.mrc 
-xmlmarc Beginning Process...
-1 records have been processed in 0.00 seconds.

Header and namespace info from xml doc:
?xml version='1.0' encoding='UTF-8' ?
collection xsi:schemaLocation='http://www.loc.gov/MARC21/slim 
http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd' 
xmlns='http://www.loc.gov/MARC21/slim' 
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
record

Should I be referencing an xslt file?

Thanks in advance,


Sean Crowe
Electronic Resources Librarian

Electronic Resources Dept.
University of Cincinnati Libraries
PO Box 210033
Cincinnati OH 45221-0033
Tel: (513) 556-1899
Fax: (513) 556-4393
Email: sean.cr...@uc.edu
gchat: crowesn


[CODE4LIB] Job announcement: Associate University Librarian for Research and Scholarly Communication, Oregon State University

2011-08-30 Thread Reese, Terry
Please share this announcement with colleagues who would be interested.
JOB ANNOUNCEMENT:
Associate University Librarian for Research and Scholarly Communication
Oregon State University Libraries

Oregon State University Libraries seeks an innovative, dynamic, and experienced 
library leader to join the organization's senior leadership team.  The 
Associate University Librarian for Research and Scholarly Communication will 
shape the Libraries' digital library strategies as they advance the development 
and communication of scholarly research and further the University's goal of 
becoming a top ten land-grant institution. As a member of the senior management 
team, the AUL for Research and Scholarly Communication will contribute to 
long-range planning, program development and evaluation, resource development, 
budget formulation, and allocation of resources in support of the Libraries' 
mission.  He/she will work with department heads to identify and implement the 
strategic directions for the Center for Digital Scholarship and Services, 
Emerging Technologies and Services, University Archives, and Special 
Collections.  The AUL must demonstrate a strong commitment to the collaborative 
development and implementation of innovative digital and web initiatives and 
services that respond adroitly to all users' evolving needs as researchers and 
scholars.
OSU Libraries has nearly 2 million volumes and vast digital resources including 
ScholarsArchive@OSU (the 4th ranked single-university repository in the U.S.), 
internationally recognized digital collections like the Oregon Explorer natural 
resources digital library, and an agile development environment which has 
produced the LibraryFind(tm) metasearch application, the Library à la Carte 
Content Management System and other digital initiatives to serve the 
university's 24,000 students, faculty scholars and researchers, and the public. 
 OSU Libraries is a member of the Orbis/Cascades Alliance of Northwest 
universities and colleges, which has a total of more than 9 million holdings.
The OSU Libraries' Special Collections include the Ava Helen and Linus Pauling 
Papers as the cornerstone for collections on the history of science and 
technology in the 20th century.  University Archives collections record the 
history of OSU and include the Oregon Multicultural Archives, which documents 
the lives and activities of ethnic minority communities in Oregon; and 
extensive collections pertaining to natural resources in Oregon and the 
Northwest.

Required Qualifications:


 *   MLS from an ALA-accredited library program or foreign equivalent.
 *   Minimum of seven years increasing responsibility in an academic or 
research library.
 *   Applied knowledge of the principles of library management and organization.
 *   Experience with budget operations and strategic planning.
 *   Knowledge of new information technologies, evolving models of scholarship, 
and the presentation of services in the Web environment and the ability to 
articulate how these influence teaching, learning and scholarship.
 *   Strong record of scholarly publication, research and national 
participation in professional societies suitable for appointment as associate 
professor
 *   Demonstrated commitment to service to all constituencies.
 *   A record of accomplishment in dealing with change and mentoring and 
coaching staff at all levels including successful experience supporting 
tenure-track faculty.
 *   Experience in working with state and/or regional consortia.
 *   Excellent analytical, interpersonal, oral and written communication skills.
 *   A demonstrable commitment to promoting and enhancing diversity.
 *   A demonstrated commitment to working collaboratively.
 *   Experience managing and administering digital library initiatives and 
services.
 *   Experience with assessment and evaluation techniques, especially as 
applied to programs and services relevant to position responsibilities

Preferred Qualifications:

 *   Additional graduate degree.
 *   Experience working with special collections and archives.
 *   Experience participating in a library fundraising and development program, 
engaging with new and ongoing donors and providing stewardship information to 
major donors

Environment:
Oregon State is a leading research university located in one of the safest, 
smartest, greenest small cities in the nation: 
http://oregonstate.edu/main/about. Situated 90 miles south of Portland, and an 
hour from the Cascades or the Pacific Coast, Corvallis is the perfect home base 
for exploring Oregon's natural wonders.  The university has an institution-wide 
commitment to diversity, multiculturalism and community and actively recruits 
and retains a diverse workforce and student body that includes members of 
historically underrepresented groups.

Employment Conditions:
Full-time, 12 month, annual tenure track appointment at the rank of Associate 
Professor.  Salary is commensurate 

Re: [CODE4LIB] z39.50 and write operations

2011-06-03 Thread Reese, Terry
Yes, but only if the server you are using supports the z39.50 extended 
attributes.  However, few commercial ils systems seem to support it by default.

Tr

*
Terry Reese
Gray Family Chair for
Innovative Library Services
121 Valley Library
Corvallis,  OR 97331
phone:  541.737.6384
*

-Original Message-
From: Eric Lease Morgan
Sent: Friday, June 03, 2011 12:16 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] z39.50 and write operations


Does Z39.50 support write operations?

We here at Notre Dame may be working with a vendor in the near future who will 
be reading our MARC bibliographic records via Z39.50. I have also been told, 
not necessarily by the vendor, that these same records will be updated and 
reinserted back into our catalog by the vendor. I am quite familiar with 
Z39.50's ability to search and download content, but I am not familiar with its 
ability to write back to the server. Is this possible?

--
Eric Lease Morgan
University of Notre Dame


Re: [CODE4LIB] is this valid marc ?

2011-05-19 Thread Reese, Terry
Jonathan, 

Karen is correct -- CR/LF are invalid characters within a MARC record.  This 
has nothing to do if the character is valid in the set -- the format itself 
doesn't allow it.

--TR

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Jonathan Rochkind
Sent: Thursday, May 19, 2011 11:29 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] is this valid marc ?

I wonder if it depends on if your record is in Marc8 or UTF-8, if I'm reading 
Karen right to say that CR/LF aren't in the Marc8 character set. 
They're certainly in UTF-8!  And a Marc record can be in UTF-8.

On 5/19/2011 2:27 PM, Jonathan Rochkind wrote:
 Is it really true that newline characters are not allowed in a marc 
 value?  I thought they were, not with any special meaning, just as 
 ordinary data.  If they're not, that's useful to know, so I don't put 
 any there!

 I'd ask for a reference to the standard that says this, but I suspect 
 it's going to be some impenetrable implication of a side effect of an 
 subtle adjective either way.

 On 5/19/2011 2:19 PM, Karen Coyle wrote:
 Quoting Andreas Orphanides andreas_orphani...@ncsu.edu:


 Anyway, I think having these two parts of the same URL data on 
 separate lines is definitely Not Right, but I am not sure if it adds 
 up to invalid MARC.

 Exactly. The CR and LF characters are NOT defined as valid in the 
 MARC character set and should not be used. In fact, in MARC there is 
 no concept of lines, only variable length strings (usually up to
  char).

 kc


 -dre.

 [1] http://www.loc.gov/marc/bibliographic/bd856.html
 [2] I am not a cataloger. Don't hurt me.
 [3] I am not an expert on MARC ingest or on ruby-marc. I could be 
 wrong.

 On 5/19/2011 12:37 PM, James Lecard wrote:
 I'm using ruby-marc ruby parser (v.0.4.2) to parse some marc files 
 I get from a partner.

 The 856 field is splitted over 2 lines, causing the ruby library to 
 ignore it (I've patched it to overcome this issue) but I want to 
 know if this kind of marc is valid ?

 =LDR  00638nam  2200181uu 4500
 =001  cla-MldNA01
 =008  080101s2008\\\|fre||
 =040  \\$aMy Provider
 =041  0\$afre
 =245  10$aThis Subject
 =260  \\$aParis$bJ. Doe$c2008
 =490  \\$aSome topic
 =650  1\$aNarratif, Autre forme
 =655  \7$abook$2lcsh
 =752  \\$aA Place on earth
 =776  \\$dParis: John Doe and Cie, 1973
 =856  \2$qtext/html
 =856
 \\$uhttp://www.this-link-will-not-be-retrieved-by-ruby-marc-library

 Thanks,

 James L.






Re: [CODE4LIB] is this valid marc ?

2011-05-19 Thread Reese, Terry
It's been a while since I looked of the ISO spec (which I still can't believe I 
had to buy to read) -- but you can certainly infer by looking at legal 
characters laid out by LC.  In reality -- only a handful of unprintable 
characters are technically allowed in a MARC record -- but you have to remember 
that when MARC was created -- it was for block reading -- and generally, early 
(and current) readers stop on hard breaks.

--TR

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Thursday, May 19, 2011 11:49 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] is this valid marc ?
 
 On 5/19/2011 2:33 PM, Reese, Terry wrote:
  Jonathan,
 
  Karen is correct -- CR/LF are invalid characters within a MARC record.  This
 has nothing to do if the character is valid in the set -- the format itself 
 doesn't
 allow it.
 
 I'm curious where in the spec it says this -- of course, it's an intellectual
 exersize at this point, because even if the spec says one thing, it doesn't
 matter if everyone (including tool-writers) has always understood it
 differently. (This is a problem for me with lots of library 'standards' 
 including
 MARC. Oh yeah, it might APPEAR to say/allow/prohibit that, but don't
 believe it, 'everyone' has always understood it diffferently. Or two parts 
 of a
 spec which contradict each other).
 
 In the glossary here:
 http://www.loc.gov/marc/specifications/speccharintro.html
 
 It does say Consequently,/code points/less than 80 (hex) have the same
 meaning in both of the encodings used in MARC 21 and may be referred to as
 ASCII in either environment. Which could be interpreted to include control
 chars such as CR and LF. (Thanks Dan Scott). Of course, the glossary section
 may not actually be an operative part of the standard, or it may not mean
 what it seems to mean, or everyone may have always acted as if it meant
 something different. Welcome to MARC.
 
 But I'm not succesfully finding anything else that says one way or another on
 the legality. Most of the ascii control chars do seem to be missing from Marc8
 (whether by design or accident), but that doesn't neccesarily mean they're
 illegal in a MARC record using some other (legal for MARC) encoding.
 
 But I believe Terry that it's not allowed (I believe Terry about just about
 everything).  It's just really an intellectual exersize in the difficulty of 
 finding
 answers in the MARC spec at the moment.


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually, you can have records that are MARC21 coming out of vendor databases 
(who sometime embed control characters into the leader) and still be valid.  
Once you stop looking at just your ILS or OCLC, you probably wouldn't be 
surprised to know that records start looking very different.

--TR



Terry Reese, Associate Professor
Gray Family Chair 
for Innovative Library Services
121 Valley Libraries
Corvallis, Or 97331
tel: 541.737.6384




 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Wednesday, April 06, 2011 9:44 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARC magic for file
 
 Can't you have a legal MARC file that does NOT have 4500 in those
 leader positions?  It's just not legal Marc21, right?   Other marc
 formats may specify or even allow flexibility in the things these bytes
 specify:
 
 * Length of the length-of-field portion
 * Number of characters in the starting-character-position portion of a
 Directory entry
 * Number of characters in the implementation-defined portion of a Directory
 entry
 
 Or, um, 23, which is I guess is left to the specific Marc implementation (ie,
 Marc21 is one such) to use for it's own purposes.
 
 I have no idea how that should inform the 'marc magic'.
 
 Is mime-type application/marc defined as specifically Marc21, or as any
 Marc?
 
 Jonathan
 
 On 4/6/2011 12:28 PM, Ford, Kevin wrote:
  Well, this brings us right up against the issue of files that adhere to 
  their
 specifications versus forgiving applications.  Think of browsers and HTML.
 Suffice it to say, MARC applications are quite likely to be forgiving of 
 leader
 positions 20-23.  In my non-conforming MARC file and in Bill's, the leader
 positions 20-21 (45) seemed constant, but things could fall apart for
 positions 22-23.  So...
 
  I present the following (in-line and attached, to preserve tabs) in an
 attempt to straddle the two sides of this issue: applications forgiving of 
 non-
 conforming files.  Should the two characters following 45 (at position 20)
 *not* be 00, then the identification will be noted as non-conforming.  We
 could classify this as reasonable identification but hardly ironclad (indeed,
 simply checking to confirm that part of the first 24 positions match the
 specification hardly constitutes a robust identification, but it's something).
 
  It will also give you a mimetype too, now.
 
  Would any like testing it out more fully on their own files?
 
 
  #
  # MARC 21 Magic  (Third cut)
 
  # Set at position 0
  0   bytex
 
  # leader position 20-21 must be 45
  20 string  45
  # leader starts with 5 digits, followed by codes specific to MARC
  format
  0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[cdn][q] MARC Community
  !:mime  application/marc
 
  # leader position 22-23, should be 00 but is it?
  0 regex/1 (^.{21})([^0]{2})   (non-conforming)
  !:mime  application/marc
 
 
  If this works, I'll see about submitting this copy.  Thanks to all your 
  efforts
 already.
 
  Warmly,
 
  Kevin
 
  --
  Library of Congress
  Network Development and MARC Standards Office
 
 
 
 
 
  
  From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Simon
  Spero [s...@unc.edu]
  Sent: Sunday, April 03, 2011 14:01
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] MARC magic for file
 
  I am pretty sure that the marc4j standard reader ignores them; the
  tolerant reader definitely does. Otherwise JHU might have about two
  parseable records based on the mangled leaders that J-Rock  gets stuck
  with :-)
 
  An analysis of the ~7M LC bib records from the scriblio.net data files
  (~ Dec 2006) indicated that leader  has less than 8 bits of
  information in it (shannon-weaver definition). This excludes the
  initial length value, which is redundant given the end of record marker.
 
 
  The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
The final characters of the leader are 450.
 
  Also, I object to the phrase decent MARC tool.  Any tool capable of
  dealing with MARC as it exists cannot afford the luxury of decency :-)
 
  [ HA: A clear conscience?
BW: Yes, Sir Humphrey.
HA: When did you acquire this taste for luxuries?]
 
  Simon
 
  On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephenso...@ostephens.com
 wrote:
 
  I'm sure any decent MARC tool can deal with them, since decent MARC
  tools are certainly going to be forgiving enough to deal with four
  characters that 

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually -- I'd disagree because that is a very narrow view of the 
specification.  When validating MARC, I'd take the approach to validate 
structure (which allows you to then read any MARC format) -- then use a 
separate process for validating content of fields, which in my opinion, is more 
open to interpretation based on system usage of the data.  For example, 22 and 
23 are undefined values that local systems may very well have a practical need 
to define and use given that there are only so many values in the leader.  This 
is why I sometimes see additional values in the 09 field (which should be a or 
blank) to define different character set types, or additional elements added to 
other fields.  If I want to validate the content of those fields, I'd validate 
it through a different process -- but I separate the process from the 
validation of the structure -- because the two are not exclusive.

--TR

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Wednesday, April 06, 2011 9:59 AM
 To: Code for Libraries
 Cc: Reese, Terry
 Subject: Re: [CODE4LIB] MARC magic for file
 
 I'm not sure what you mean Terry.  Maybe we have different understandings
 of valid.
 
 If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not a
 valid Marc21 file. It violates the Marc21 specification.
 
 Now, they may still be _usable_, by software that ignores these bytes
 anyway or works around them. We definitely have a lot of software that
 does that.
 
 Which can end up causing problems that remind me of very analagous
 problems caused by the early days of web browsers that felt like being
 'tolerant' of bad data. My html works in every web brower BUT this one,
 why not? Oh, becuase that's the only one that actually followed the
 standard, oops.
 
 I actually ran into an example of that problem with this exact issue.
 MOST software just ignores marc leader bytes 20-23, and assumes the
 semantics of 4500---the only legal semantics for Marc21.  But Marc4j
 actually _respected_ them, apparently the author thought that some marc in
 the wild might intentionally set different bytes here (no idea if that's true 
 or
 not). So if the leader bytes 20-23 were invalid
 (according to the spec), Marc47 would suddenly decide that the length of
 field portion was NOT 4, but actually BELIEVE whatever was in leader byte
 20, causing the record to be parsed improperly.  And I had records like that
 coming out of my ILS (not even a vendor database). That was an unfun
 couple days of debugging to figure out what was going on.
 
 On 4/6/2011 12:52 PM, Reese, Terry wrote:
  Actually, you can have records that are MARC21 coming out of vendor
 databases (who sometime embed control characters into the leader) and still
 be valid.  Once you stop looking at just your ILS or OCLC, you probably
 wouldn't be surprised to know that records start looking very different.
 
  --TR
 
 
  
  Terry Reese, Associate Professor
  Gray Family Chair
  for Innovative Library Services
  121 Valley Libraries
  Corvallis, Or 97331
  tel: 541.737.6384
  
 
 
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
  Of Jonathan Rochkind
  Sent: Wednesday, April 06, 2011 9:44 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] MARC magic for file
 
  Can't you have a legal MARC file that does NOT have 4500 in those
  leader positions?  It's just not legal Marc21, right?   Other marc
  formats may specify or even allow flexibility in the things these
  bytes
  specify:
 
  * Length of the length-of-field portion
  * Number of characters in the starting-character-position portion of
  a Directory entry
  * Number of characters in the implementation-defined portion of a
  Directory entry
 
  Or, um, 23, which is I guess is left to the specific Marc
  implementation (ie,
  Marc21 is one such) to use for it's own purposes.
 
  I have no idea how that should inform the 'marc magic'.
 
  Is mime-type application/marc defined as specifically Marc21, or as
  any Marc?
 
  Jonathan
 
  On 4/6/2011 12:28 PM, Ford, Kevin wrote:
  Well, this brings us right up against the issue of files that adhere
  to their
  specifications versus forgiving applications.  Think of browsers and HTML.
  Suffice it to say, MARC applications are quite likely to be forgiving
  of leader positions 20-23.  In my non-conforming MARC file and in
  Bill's, the leader positions 20-21 (45) seemed constant, but things
  could fall apart for positions 22-23.  So...
  I present the following (in-line and attached, to preserve tabs) in
  an
  attempt to straddle the two sides of this issue: applications
  forgiving of non- conforming files.  Should the two characters
  following 45 (at position 20)
  *not* be 00, then the identification will be noted as
  non-conforming.  We could classify this as reasonable
  identification but hardly ironclad

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
I'm honestly not family with magic.  I can tell you in MarcEdit, the way that 
the process works is there is a very generic function that reads the structure 
of the data not trusting the information in the leader (since I find this data 
very un-reliable).  Then, if users want to apply a set of rules to the 
validation -- I apply those as a secondary process.  If you are looking to 
validate specific content within a record, then what you want to do in this 
function may be appropriate -- though you'll find some local systems will 
consistently fail the process.

--tr


From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of William Denton 
[w...@pobox.com]
Sent: Wednesday, April 06, 2011 10:29 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

On 6 April 2011, Reese, Terry wrote:

 Actually -- I'd disagree because that is a very narrow view of the
 specification.  When validating MARC, I'd take the approach to validate
 structure (which allows you to then read any MARC format) -- then use a
 separate process for validating content of fields, which in my opinion,
 is more open to interpretation based on system usage of the data.

What do you think is the best way to recognize MARC files (up to some
level of validity, given all the MARC you've seen and parsed) that could
be made to work the way magic is defined?

Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Reese, Terry
I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in 
MARC-8.  I'd guess the file isn't in UTF8.

--TR

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Wednesday, April 06, 2011 1:28 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
 
 I am not familar with that Perl module. But I'm more familiar then I'd want
 with char encoding in Marc.
 
 I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
 familiar with in past debugging, but I've forgotten em), but the first things 
 to
 look at:
 
 1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
 Theoretically there is a Marc leader byte that tells you whether it's
 Marc8 or UTF-8, but the leader byte is often wrong in real world records.  Is 
 it
 wrong?
 
 2. Does Perl MARC::Batch  have a function to convert from Marc8 to
 UTF-8?   If so, how does it decide whether to convert? Is it trying to
 do that?  Is it assuming that the leader byte the record accurately
 identifies the encoding, and if so, is the leader byte wrong?   Is it
 trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
 first place?  Or is it assuming the source was UTF-8 in the first place, when 
 in
 fact it was Marc8?
 
 Not the answer you wanted, maybe someone else will have that. Debugging
 char encoding is hands down the most annoying kind of debugging I ever do.
 
 On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
  Ack! While using the venerable Perl MARC::Batch module I get the
 following error while trying to read a MARC record:
 
 utf8 \xC2 does not map to Unicode
 
  This is a real pain, and I'm hoping someone here can help me either: 1) trap
 this error allowing me to move on, or 2) figure out how to open the file
 correctly.
 


Re: [CODE4LIB] marcxml

2010-11-11 Thread Reese, Terry
Yes -- that's right.  There is a zip file with install instructions for any 
non-windows based system for which a MONO port is present.

--TR

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Joel Marchesoni
 Sent: Thursday, November 11, 2010 8:40 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] marcxml
 
 There actually is a version of MARCEdit for Linux now. I think
 (although I can't remember and can't find it on the site) that it
 relies on Mono.
 
 MARCEdit download page:
 http://people.oregonstate.edu/~reeset/marcedit/html/downloads.html
 
 Joel
 
 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 J.D.Gravestock
 Sent: Thursday, November 11, 2010 6:26 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] marcxml
 
 I'd be interested to know if anyone is using a good marcxml to marc
 converter (other than marcedit, i.e. non windows).  I've tried the perl
 module marc::xml but having a few problems with the conversion which I
 can't replicate in marcedit. Are there any that I've missed?
 
 
 Jill
 
 **
 Jill Gravestock
 Open University Library
 Milton Keynes
 
 
 
 
 --
 The Open University is incorporated by Royal Charter (RC 000391), an
 exempt charity in England  Wales and a charity registered in Scotland
 (SC 038302).
 
 
 --


Re: [CODE4LIB] Copy Cataloging MARC record manipulation

2010-08-19 Thread Reese, Terry
Andy,

Since I write marcedit, maybe I can help.  If you can give me an idea what you 
are up to, I'll see if its something that can be dealt with.

Tr


Terry Reese
Gray Family Chair for Innovative
Library Services
Oregon State University Libraries
Corvallis, OR  97331
tel:  541.737.6384
web: http://people.oregonstate.edu/~reeset/


On Aug 19, 2010, at 6:16 PM, Andy Kelly a.m.ke...@gmail.com wrote:

 Greetings all,
 I am in a bit of a fix. I am working to get my library working up a more
 effective copy-cataloging workflow and was looking for some software
 suggestions.
 I'm more or less trapped on Windows XP and have so far been running Mercury
 Z39.50 client with some success. My search would end here if exporting one.
 record. at. a. time. wasn't so painful.
 I've been evaluating MarcEdit and it's associated Z39.50 Client. I've found
 it to be slow, buggy and always trapped in windows of fixed sizes. It can
 also only search one Z39.50 server at a time, so it replaces one bottleneck
 with another. I get the impression I'm sort of in the Dark Ages here in that
 we're not just OCLC copy-cataloging subscribers, but I can't seem to
 convince my superiors that that service is worth making room for in the
 budget, though perhaps this is a more common situation than I'm aware of.
 
 Ideally: I feed in a txt file or CSV of ISBNs and I get out one big MARC
 record to feed my [ancient, fussy] OPAC.
 
 This might be one of those ...why don't you do it with a Perl script?
 problems that might get me to really dive into my copy of Introducing Perl.
 (I've looked at the ZOOM Perl Bindings and MARC module on CPAN, both look
 promising but far beyond my current limited abilities and likely even
 further beyond my boss, future replacements and/or student worker's ability
 to maintain or use.)
 
 Thanks for your help  suggestions.
 ~Andy


[CODE4LIB] Pacific Northwest Code4Lib chapter and meeting

2009-04-02 Thread Reese, Terry
FYI for the larger group.  Since many members in the PNW simply cannot
travel to the larger C4L meeting due to budgetary restraints (this year,
and very likely the next), etc -- we will be starting up a PNW local
chapter and hosting a one day C4L meeting for those in the area that are
interested, but maybe otherwise were not able to attend the annual C4L
meeting.  Info can be found at:
http://groups.google.com/group/pnwcode4lib?hl=en.  Plus, it will give
the PNW a group that can start crafting a plan to bring the C4L
conference back to its PNW home. J

 

--TR

 

 

***

Terry Reese

The Gray Family Chair for Innovative Library Services

Oregon State University Libraries

Corvallis, OR  97331

tel: 541-737-6384

email: terry.re...@oregonstate.edu

http: http://oregonstate.edu/~reeset

*** 

 

 


[CODE4LIB] FW: Please visit RDA Test Website

2009-03-27 Thread Reese, Terry
Posted on behalf of Dianne McCutcheon
 
*
Terry Reese
The Gray Family Chair for Innovative Library Services
Oregon State University Libraries
Corvallis, OR  97331
tel: 541-737-6384
email: terry.re...@oregonstate.edu
http: http://oregonstate.edu/~reeset
*
*

The US National Libraries RDA Test Steering Committee has launched a Website 
for the RDA test project, at URL

http://www.loc.gov/bibliographic-future/rda/ 
http://www.loc.gov/bibliographic-future/rda/%20  

The site includes a link to a fill-in PDF application form that you can use to 
let us know if you're interested in being selected as a test partner.

The Test Steering Committee received excellent comments about the project after 
the RDA Test Planning Forum at ALA Midwinter in Denver.  As a result of this 
feedback, we realized that we needed to ask for more precise information from 
the potential test participants.  So we revised the application form and made 
it available on the RDA Test Planning Website.  Please complete and return the 
form, even if you submitted an expression of interest earlier.

The Website also has links to a proposed timeline and to the methodology that 
the Steering Committee plans to use for the testing.  We'll update the site 
with additional information as we develop a complete test protocol.  

Thank you very much for your interest in the US National Libraries RDA Test 
project.  We look forward to hearing from you.  As the application form states, 
we're requesting that anyone interested in participating as a test partner 
return the PDF application, via email, by April 13 to Susan Morris.  The email 
link in the form will return it to Susan.  Please get in touch with her if you 
have any questions or if there is any problem with the PDF.  

Susan R. Morris

Special Assistant to the Director for Acquisitions and Bibliographic Access

Library of Congress 

voice: 202-707-6073

fax: 202-252-3220

For the US National Libraries RDA Test Steering Committee: co-chairs Chris 
Cole, Dianne McCutcheon, and Beacher Wiggins  

 


Re: [CODE4LIB] Zotero under attack

2008-09-28 Thread Reese, Terry
This seems like a real grey area.  I can see Thomson Scientific 
putting up a fuss when using ENS files generated by the creator of 
EndNote.  But ENS files can -- and have -- be created by just about 
anyone (librarians, journal publishers, researchers) and published on 
the open web.  
 
I'm not sure that's what they are saying.  Endnote does come with ens files 
that they create (I believe, that was the case the last time I looked at the 
software), managed and provided as part of their application.  They certainly 
can claim rights to those (this isn't really a gray area) -- and unless the 
Zotero software is able to determine user generated files from files 
distributed as part of the Endnote application, then it could be problematic.
 
--TR
 
***
Terry Reese
Cataloger for Networked Resources
Digital Production Unit Head
Oregon State University Libraries
Corvallis, OR  97331
tel: 541-737-6384
email: [EMAIL PROTECTED]
http: http://oregonstate.edu/~reeset
***



From: Code for Libraries on behalf of Peter Murray
Sent: Sun 9/28/2008 5:46 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Zotero under attack



-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I've posted some analysis and plenty of links to critical bits at 
http://dltj.org/article/endnote-zotero-lawsuit/

Some other thoughts...

On Sep 26, 2008, at 4:01 PM, Reese, Terry wrote:
 While reverse engineering the .ens
 style files really isn't that big of a deal (this kind of reverse
 engineering is generally legally permitted), utilizing the collected
 knowledge-base from an End-note application is.  I've run into this in
 the past with other software that I've worked on -- there is a good 
 deal
 of legal tiptoeing that often needs to be done when you are building
 software that will essentially bird dog another (proprietary)
 application's knowledge-base.


This seems like a real grey area.  I can see Thomson Scientific 
putting up a fuss when using ENS files generated by the creator of 
EndNote.  But ENS files can -- and have -- be created by just about 
anyone (librarians, journal publishers, researchers) and published on 
the open web.  I don't see anything in the license agreement or argued 
elsewhere that says Thomson Scientific has rights over these 
works (the citation definition files) created and published by 
others.  That would seem akin to Microsoft claiming rights over 
documents written in Word.


Peter
- --
Peter Murrayhttp://www.pandc.org/peter/work/
Assistant Director, New Service Development  tel:+1-614-728-3600;ext=338
OhioLINK: the Ohio Library and Information NetworkColumbus, Ohio
The Disruptive Library Technology Jesterhttp://dltj.org/
Attrib-Noncomm-Share   http://creativecommons.org/licenses/by-nc-sa/2.5/


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (Darwin)
Comment: Ask me how you can start digitally signing your email!

iD8DBQFI4CVf4+t4qSfPIHIRAkYFAJ0Qq85j1IXKv9aAnexFo+kvbS/eEACcCuCY
kXoL085OZqvLFtbb+tb3LRI=
=2Z92
-END PGP SIGNATURE-


Re: [CODE4LIB] Zotero under attack

2008-09-26 Thread Reese, Terry
Hopefully, this quote from the article:

A significant and highly touted feature of the new beta
version of Zotero, however, is its ability to convert - in direct
violation of the License Agreement - Thomson's 3,500 plus proprietary
.ens style files within the EndNote Software into free, open source,
easily distributable Zotero .csl files

isn't quite this straightforward.  While reverse engineering the .ens
style files really isn't that big of a deal (this kind of reverse
engineering is generally legally permitted), utilizing the collected
knowledge-base from an End-note application is.  I've run into this in
the past with other software that I've worked on -- there is a good deal
of legal tiptoeing that often needs to be done when you are building
software that will essentially bird dog another (proprietary)
application's knowledge-base. 

--TR

 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf
Of
 wally grotophorst
 Sent: Friday, September 26, 2008 12:09 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] Zotero under attack
 

http://www.courthousenews.com/2008/09/17/Reuters_Says_George_Mason_Univ
 ersity_Is_Handing_Out_Its_Proprietary_Software.htm
 
 I guess stuff like this is what gives me that anti-corporate bias...


Re: [CODE4LIB] what's the best way to get from Portland to San Francisco on Feb 28?

2008-02-20 Thread Reese, Terry
You'll want to fly.  On the West Coast, taking the train is a bit of a crap 
shoot and wouldn't advise it unless you had a day between when you are suppose 
to arrive and when you need to arrive.  The few times I've taken Amtrak on the 
West coast between Seattle and Los Angelos, I've never been on time.  I've been 
anywhere between 5 hours to one day late depending the distance needed to 
travel.  In fact, given my past experience, if I wasn't going to fly -- I would 
drive.  It will take you approximately 12-13 hours to drive down I-5 from 
Portland to San Francisco.  By train, almost twice as long.
 
--TR
 
***
Terry Reese
Cataloger for Networked Resources
Digital Production Unit Head
Oregon State University Libraries
Corvallis, OR  97331
tel: 541-737-6384
email: [EMAIL PROTECTED]
http: http://oregonstate.edu/~reeset
***



From: Code for Libraries on behalf of Elizabeth Sadler
Sent: Wed 2/20/2008 6:59 PM
To: CODE4LIB@listserv.nd.edu
Subject: [CODE4LIB] what's the best way to get from Portland to San Francisco 
on Feb 28?



Dear Code4Lib folks,

Can some of you west-coasters advise me on the best (read: cheapest
and most fun) way to get from Portland to San Francisco on February
28? Taking a plane is my last resort. My first choice would be
hitching a ride with any conference attendees who would be going that
way anyway, and I also thought about taking the train but Amtrak has
stymied me. Greyhound would take much too long, I imagine.

Anyway, I thought it was worth a question. Any suggestions from the
community?

Bess

Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services (DSS)
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


Re: [CODE4LIB] getting Worldcat records

2008-02-07 Thread Reese, Terry
Since these are your libraries' records, you can certainly download them
again from OCLC.  I've also known libraries in the past that have been
able to have oclc generate a subset of records from their database --
though in these cases, this always has involved a cost to purchase the
records.  In terms of how easy it is to do on your own -- if you don't
have OCLC do it, you would likely need a list of all the OCLC numbers
that you are interested in.  With that list, you could easily batch
export the data again from Worldcat using Connexion.

--TR



 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf
Of
 Alberto Accomazzi
 Sent: Thursday, February 07, 2008 10:35 AM
 To: CODE4LIB@listserv.nd.edu
 Subject: [CODE4LIB] getting Worldcat records

 Our project maintains a database of bibliographic metadata for all
 things in astronomy and most of physics.  We'd like to add records for
 books that have been recently added to our library and to correlate
 existing records with the library holdings.  Sounds easy enough, but
 because of the intricacies of Harvard libraries administration we
 haven't been able to get a dump of the records, much less a feed.

 The recent emails about OCLC worldcat records made me wonder if we
 could
 get the equivalent data from them (since our library subscribes to
 them).  Essentially what I'd like is a dump of all QB and QC records
in
 OCLC entered by Harvard, so we can index them and then point to the
 library record in OCLC.  Is this (a) legal, (b) feasible, (c) easy?  I
 assume the answer to (a) and (b) is yes, since we have our library's
 support.  If not, are there alternatives?  I learned about openlibrary
 only yesterday, so I haven't had a chance to explore what's in it
 yet...

 Thanks,

 -- Alberto


Re: [CODE4LIB] Records for Open Library

2008-02-06 Thread Reese, Terry
  Isn't sharing such records a no-no?
No, OCLC's guidelines for transfer 
(http://www.oclc.org/support/documentation/worldcat/records/guidelines/default.htm)
 specifically give unrestricted transfer rights to libraries and non-commercial 
entities.  The Open Library is both.  It's a registered library in California 
and a non-profit.  So in either situtation, it's not a  problem.
 
--TR
 
***
Terry Reese
Cataloger for Networked Resources
Digital Production Unit Head
Oregon State University Libraries
Corvallis, OR  97331
tel: 541-737-6384
email: [EMAIL PROTECTED]
http: http://oregonstate.edu/~reeset
***



From: Code for Libraries on behalf of Peter Murray
Sent: Wed 2/6/2008 2:50 PM
To: CODE4LIB@listserv.nd.edu
Subject: Re: [CODE4LIB] Records for Open Library



-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Feb 5, 2008, at 12:11 PM, K.G. Schneider wrote:
 Has your library considered contributing records to Open Library (
 http://www.openlibrary.org/ )? If so I'd like to hear from you on or
 off
 list.


How would that work?  Most of the records in OhioLINK are probably
derived from OCLC Worldcat.  Isn't sharing such records a no-no?


Peter
- --
Peter Murrayhttp://www.pandc.org/peter/work/
Assistant Director, New Service Development  tel:+1-614-728-3600;ext=338
OhioLINK: the Ohio Library and Information NetworkColumbus, Ohio
The Disruptive Library Technology Jesterhttp://dltj.org/
Attrib-Noncomm-Share   http://creativecommons.org/licenses/by-nc-sa/2.5/


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (Darwin)

iD8DBQFHqjmq4+t4qSfPIHIRAqggAKDGoUmRO/7tcmdTn7f8YEnaBTbhQQCfYSBy
yJU+FrMcWRUGURJk29iDx5w=
=CEg4
-END PGP SIGNATURE-


Re: [CODE4LIB] low-cost software for prison libraries?

2008-01-30 Thread Reese, Terry
I'd suggest Koha -- but if they are looking for something simple and lowcost, 
you could try something like CDS/ISIS 
(http://portal.unesco.org/ci/en/ev.php-URL_ID=5330URL_DO=DO_TOPICURL_SECTION=201.html)
 -- it's free and developed by Unesco.  The other one you could try 
ResourceMate (http://www.resourcemate.com/) -- its a low cost windows based 
system that I've used before.  This list here might also be useful -- though a 
little dated: http://www.librarysupportstaff.com/4automate.html
 
--TR
 
***
Terry Reese
Cataloger for Networked Resources
Digital Production Unit Head
Oregon State University Libraries
Corvallis, OR  97331
tel: 541-737-6384
email: [EMAIL PROTECTED]
http: http://oregonstate.edu/~reeset
***



From: Code for Libraries on behalf of Jonathan Rochkind
Sent: Wed 1/30/2008 8:54 PM
To: CODE4LIB@listserv.nd.edu
Subject: [CODE4LIB] low-cost software for prison libraries?



Hi all, this is forwarded from a prison librarian listserv.  Does
anyone know of any very low-cost (or open source?) library systems
that would be suitable for small and/or  low-staffed libraries?   I'm
thinking something like Koha or Evergreen would probably be overkill
and/or too hard to install without much/any tech/systems staff, but I
could very well be wrong, I don't know much about either system. I
also don't know much about the needs of that kind of small library.

If anyone does have ideas, could you send them directly to Mary (in
addition to CCing the list if you want, because I'm interested too
and I bet other list members would be.).

I've been curious for a while about solutions available to the very
small/limited-resource library in the way 'automation', but know
almost nothing about it and am not sure if there's an easy way to
find out.  If anyone happens to know something about this (or is
interested in researching it), I personally think the Code4Lib
Journal would be a great place to publish an essay or survey on that
topic.

Jonathan

Begin forwarded message:

 From: [EMAIL PROTECTED]
 Date: January 30, 2008 9:12:19 PM EST
 To: [EMAIL PROTECTED]
 Subject: [prison-l] Library automation software

 Greetings:

 Last month there was some discussion here about cheap/free/
 reasonably priced automation software for correctional libraries.
 I am on a statewide committee which has just been formed to
 research and recommend a software package to replace Athena
 (formerly by Sagebrush, now Follett) in most of the correctional
 libraries in Virginia.  After years in public libraries I am very
 familiar with some of the big vendors, but they are simply
 financially out of the question for our agency, not to mention web-
 based.

 I have looked at the websites for LibraryThing, Auto Librarian, and
 ResourceMate, which were recommended here in the previous
 discussion.  If you know of or have a circ/cat system that is
 reasonably priced (or dirt cheap) and works well for you, please
 share the information with me, with pros and cons if you like.  All
 replies greatly appreciated, and thanks in advance.


 Mary Geist, librarian
 Dept. of Correctional Education
 Brunswick Correctional Center
 1147 Planter's Road
 Lawrenceville, VA  23868
 434.848.4131, ext. 1146



Re: [CODE4LIB] Library Software Manifesto

2007-11-06 Thread Reese, Terry
Roy, 
 
While your rights are interesting, the consumer responsibilities I find are 
actually more important (and always more difficult to see followed).  As some 
that develops software for wide public consumption (read, not developers but 
the computer illiterate in many cases), I find that points 1-3 are the most 
difficult for most people.  Invariably, people don't really know what they want 
from an application -- just an idea of a workflow as to how something might 
have worked (or had learned) before.  Likewise, most assume that if you just 
say, x doesn't work then as the developer you'll be able to decode the 
problem.  Sometimes, I can decode the problem as the User (which tells me that 
what I'm doing needs to be more straightforward) -- other times, I rely on the 
user to provide as much information as possible to reproduce problems which can 
be like a trip to the dentist.  
 
I think our software vendors are in the same position.  Many have fallen to 
sleep in terms of understanding what libraries want today -- but at the same 
time -- librarians have traditionally been, as a user group (I'm painting in 
broad strokes here), a bunch of whinners that really don't know what they want 
to begin with.  Any library software manifesto that includes vendor 
responsibilies needs to equally highlight the responsiblities users have in 
this relationship (which looks like the direction you are going -- just don't 
undersell it).
 
--TR
 
***
Terry Reese
Cataloger for Networked Resources
Digital Production Unit Head
Oregon State University Libraries
Corvallis, OR  97331
tel: 541-737-6384
email: [EMAIL PROTECTED]
http: http://oregonstate.edu/~reeset
***



From: Code for Libraries on behalf of Roy Tennant
Sent: Tue 11/6/2007 10:07 AM
To: CODE4LIB@listserv.nd.edu
Subject: [CODE4LIB] Library Software Manifesto



I have a presentation coming up and I'm considering doing what I'm calling a
Library Software Manifesto. Some of the following may not be completely
understandable on the face of it, and I would be explaining the meaning
during the presentation, but this is what I have so far and I'd be
interested in other ideas this group has or comments on this. Thanks,
Roy

Consumer Rights

- I have a right to use what I buy
- I have a right to the API if I've bought the product
- I have a right to accurate, complete documentation
- I have a right to my data
- I have a right to not have simple things needlessly complicated

Consumer Responsibilities

- I have a responsibility to communicate my needs clearly and specifically
- I have a responsibility to report reproducible bugs in a way as to
facilitate reproducing it
- I have a responsibility to report irreproducible bugs with as much detail
as I can provide
- I have a responsibility to request new features responsibly
- I have a responsibility to view any adjustments to default settings
critically


Re: [CODE4LIB] library find and bibliographic citation export?

2007-09-27 Thread Reese, Terry
COINs are included in the output, but because the current pages are loaded via 
AJAX, the data isn't visible to browser plugins like Libx, Zotero, etc.  0.8.3 
will remove nearly all the ajax -- and when that happens, the COINS data should 
be visible.
 
--TR
 
***
Terry Reese
Cataloger for Networked Resources
Digital Production Unit Head
Oregon State University Libraries
Corvallis, OR  97331
tel: 541-737-6384
email: [EMAIL PROTECTED]
http: http://oregonstate.edu/~reeset
***



From: Code for Libraries on behalf of Karen Coombs
Sent: Thu 9/27/2007 11:31 AM
To: CODE4LIB@listserv.nd.edu
Subject: Re: [CODE4LIB] library find and bibliographic citation export?



I believe that LibraryFind includes COinS but they aren't working quite
right in the current version. If the COinS were working correctly (which
they are supposed to in the next version) then Zotero would read them and
allow you to import results. I don't know of anyone who has added a citation
export feature otherwise though.

Jeremy or Terry please correct me if I've got my COinS information in which
version confused.

Karen


On 9/27/07 11:57 AM, Tim Shearer [EMAIL PROTECTED] wrote:

 Hi,

 I'm interested to know if anyone working with LibraryFind has begun work
 to create a tool for bibliographic export to citation management tools
 like refworks, etc.

 Thanks!
 Tim

 +++
 Tim Shearer

 Web Development Coordinator
 The University Library
 University of North Carolina at Chapel Hill
 [EMAIL PROTECTED]
 919-962-1288
 +++

--
Karen A. Coombs
Head of Libraries' Web Services
University of Houston
114 University Libraries
Houston, TX  77204-2000
Phone: (713) 743-3713
Fax: (713) 743-9811
Email: [EMAIL PROTECTED]


Re: [CODE4LIB] Polls open for Code4Lib 2007 T-Shirt design

2007-01-29 Thread Reese, Terry
Per the Rosie the Riveter Memorial (http://www.rosietheriveter.org/faq.htm) 
regarding the image.  Given that its a commissioned work by the United States 
War Production Commission, I'd say that its likely to be in the public domain.  
I wouldn't worry about it.

4. Is the Rosie the Riveter image copyrighted?

The image that has become most widely known was commissioned by the United 
States War Production Commission - Co-coordinating Committee for use on a 
recruiting poster in 1943. It was intended to be displayed for only two weeks, 
February 15 through February 28. The artist was J. Howard Miller. It is widely 
held that this image is in the public domain, but we are aware of no official 
documentation to that effect. There are less well-known images, including a 
painting by Norman Rockwell entitled Rosie the Riveter, that remain under 
copyright.

 

--TR
 
***
Terry Reese
Cataloger for Networked Resources
Digital Production Unit Head
Oregon State University Libraries
Corvallis, OR  97331
tel: 541-737-6384
email: [EMAIL PROTECTED]
http: http://oregonstate.edu/~reeset
***



From: Code for Libraries on behalf of Edward Corrado
Sent: Mon 1/29/2007 5:45 AM
To: CODE4LIB@listserv.nd.edu
Subject: Re: [CODE4LIB] Polls open for Code4Lib 2007 T-Shirt design



I have no idea of the legal status of the photo, I believe the length of
time for copyright in the USA is 75 years (unless your Disney, then it
is for ever), thus it still may be covered on this side of the pond. I
think we  need to clarify this before printing up a bunch of shirts with
this photo.

Edward - who doubts anyone will be chasing after code4lib but still, we
should do things the right way

Rob Styles said the following on 1/29/2007 4:54 AM:
 The photo is an original WWII photo from 1944, it's outside of the 50
 years covered by copyright here in the UK in is in use by several
 different organisations. I believe we don't need any clearance.

 rob


 Rob Styles
 Programme Manager, Data Services, Talis
 tel: +44 (0)870 400 5000
 fax: +44 (0)870 400 5001
 direct: +44 (0)870 400 5004
 mobile: +44 (0)7971 475 257
 msn: [EMAIL PROTECTED]
 irc: irc.freenode.net/mrob,isnick



 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf

 Of

 Roy Tennant
 Sent: 26 January 2007 21:10
 To: CODE4LIB@listserv.nd.edu
 Subject: Re: [CODE4LIB] Polls open for Code4Lib 2007 T-Shirt design

 I hate to be the one to raise this, but it seems like I must since the
 design is leading in the polls, but do we have (or can we obtain) the
 right
 to reproduce that photo?
 Roy


 The very latest from Talis
 read the latest news at www.talis.com/news
 listen to our podcasts www.talis.com/podcasts
 see us at these events www.talis.com/events
 join the discussion here www.talis.com/forums
 join our developer community www.talis.com/tdn
 and read our blogs www.talis.com/blogs


 Any views or personal opinions expressed within this email may not be those 
 of Talis Information Ltd. The content of this email message and any files 
 that may be attached are confidential, and for the usage of the intended 
 recipient only. If you are not the intended recipient, then please return 
 this message to the sender and delete it. Any use of this e-mail by an 
 unauthorised recipient is prohibited.


 Talis Information Ltd is a member of the Talis Group of companies and is 
 registered in England No 3638278 with its registered office at Knights Court, 
 Solihull Parkway, Birmingham Business Park, B37 7YB.


--
Edward M. Corrado
http://www.tcnj.edu/~corrado/
Systems Librarian
The College of New Jersey
403E TCNJ Library
PO Box 7718 Ewing, NJ 08628-0718
Tel: 609.771.3337  Fax: 609.637.5177
Email: [EMAIL PROTECTED]