Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually, you can have records that are MARC21 coming out of vendor databases 
(who sometime embed control characters into the leader) and still be valid.  
Once you stop looking at just your ILS or OCLC, you probably wouldn't be 
surprised to know that records start looking very different.

--TR



Terry Reese, Associate Professor
Gray Family Chair 
for Innovative Library Services
121 Valley Libraries
Corvallis, Or 97331
tel: 541.737.6384




 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Wednesday, April 06, 2011 9:44 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARC magic for file
 
 Can't you have a legal MARC file that does NOT have 4500 in those
 leader positions?  It's just not legal Marc21, right?   Other marc
 formats may specify or even allow flexibility in the things these bytes
 specify:
 
 * Length of the length-of-field portion
 * Number of characters in the starting-character-position portion of a
 Directory entry
 * Number of characters in the implementation-defined portion of a Directory
 entry
 
 Or, um, 23, which is I guess is left to the specific Marc implementation (ie,
 Marc21 is one such) to use for it's own purposes.
 
 I have no idea how that should inform the 'marc magic'.
 
 Is mime-type application/marc defined as specifically Marc21, or as any
 Marc?
 
 Jonathan
 
 On 4/6/2011 12:28 PM, Ford, Kevin wrote:
  Well, this brings us right up against the issue of files that adhere to 
  their
 specifications versus forgiving applications.  Think of browsers and HTML.
 Suffice it to say, MARC applications are quite likely to be forgiving of 
 leader
 positions 20-23.  In my non-conforming MARC file and in Bill's, the leader
 positions 20-21 (45) seemed constant, but things could fall apart for
 positions 22-23.  So...
 
  I present the following (in-line and attached, to preserve tabs) in an
 attempt to straddle the two sides of this issue: applications forgiving of 
 non-
 conforming files.  Should the two characters following 45 (at position 20)
 *not* be 00, then the identification will be noted as non-conforming.  We
 could classify this as reasonable identification but hardly ironclad (indeed,
 simply checking to confirm that part of the first 24 positions match the
 specification hardly constitutes a robust identification, but it's something).
 
  It will also give you a mimetype too, now.
 
  Would any like testing it out more fully on their own files?
 
 
  #
  # MARC 21 Magic  (Third cut)
 
  # Set at position 0
  0   bytex
 
  # leader position 20-21 must be 45
  20 string  45
  # leader starts with 5 digits, followed by codes specific to MARC
  format
  0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[cdn][q] MARC Community
  !:mime  application/marc
 
  # leader position 22-23, should be 00 but is it?
  0 regex/1 (^.{21})([^0]{2})   (non-conforming)
  !:mime  application/marc
 
 
  If this works, I'll see about submitting this copy.  Thanks to all your 
  efforts
 already.
 
  Warmly,
 
  Kevin
 
  --
  Library of Congress
  Network Development and MARC Standards Office
 
 
 
 
 
  
  From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Simon
  Spero [s...@unc.edu]
  Sent: Sunday, April 03, 2011 14:01
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] MARC magic for file
 
  I am pretty sure that the marc4j standard reader ignores them; the
  tolerant reader definitely does. Otherwise JHU might have about two
  parseable records based on the mangled leaders that J-Rock  gets stuck
  with :-)
 
  An analysis of the ~7M LC bib records from the scriblio.net data files
  (~ Dec 2006) indicated that leader  has less than 8 bits of
  information in it (shannon-weaver definition). This excludes the
  initial length value, which is redundant given the end of record marker.
 
 
  The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
The final characters of the leader are 450.
 
  Also, I object to the phrase decent MARC tool.  Any tool capable of
  dealing with MARC as it exists cannot afford the luxury of decency :-)
 
  [ HA: A clear conscience?
BW: Yes, Sir Humphrey.
HA: When did you acquire this taste for luxuries?]
 
  Simon
 
  On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephenso...@ostephens.com
 wrote:
 
  I'm sure any decent MARC tool can deal with them, since decent MARC
  tools are certainly going to be forgiving enough to deal with four
  characters that 

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind
I'm not sure what you mean Terry.  Maybe we have different 
understandings of valid.


If leader bytes 20-23 are not 4500, I suggest that is _by definition_ 
not a valid Marc21 file. It violates the Marc21 specification.


Now, they may still be _usable_, by software that ignores these bytes 
anyway or works around them. We definitely have a lot of software that 
does that.


Which can end up causing problems that remind me of very analagous 
problems caused by the early days of web browsers that felt like being 
'tolerant' of bad data. My html works in every web brower BUT this one, 
why not? Oh, becuase that's the only one that actually followed the 
standard, oops.


I actually ran into an example of that problem with this exact issue. 
MOST software just ignores marc leader bytes 20-23, and assumes the 
semantics of 4500---the only legal semantics for Marc21.  But Marc4j 
actually _respected_ them, apparently the author thought that some marc 
in the wild might intentionally set different bytes here (no idea if 
that's true or not). So if the leader bytes 20-23 were invalid 
(according to the spec), Marc47 would suddenly decide that the length 
of field portion was NOT 4, but actually BELIEVE whatever was in leader 
byte 20, causing the record to be parsed improperly.  And I had records 
like that coming out of my ILS (not even a vendor database). That was an 
unfun couple days of debugging to figure out what was going on.


On 4/6/2011 12:52 PM, Reese, Terry wrote:

Actually, you can have records that are MARC21 coming out of vendor databases 
(who sometime embed control characters into the leader) and still be valid.  
Once you stop looking at just your ILS or OCLC, you probably wouldn't be 
surprised to know that records start looking very different.

--TR



Terry Reese, Associate Professor
Gray Family Chair
for Innovative Library Services
121 Valley Libraries
Corvallis, Or 97331
tel: 541.737.6384





-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Wednesday, April 06, 2011 9:44 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

Can't you have a legal MARC file that does NOT have 4500 in those
leader positions?  It's just not legal Marc21, right?   Other marc
formats may specify or even allow flexibility in the things these bytes
specify:

* Length of the length-of-field portion
* Number of characters in the starting-character-position portion of a
Directory entry
* Number of characters in the implementation-defined portion of a Directory
entry

Or, um, 23, which is I guess is left to the specific Marc implementation (ie,
Marc21 is one such) to use for it's own purposes.

I have no idea how that should inform the 'marc magic'.

Is mime-type application/marc defined as specifically Marc21, or as any
Marc?

Jonathan

On 4/6/2011 12:28 PM, Ford, Kevin wrote:

Well, this brings us right up against the issue of files that adhere to their

specifications versus forgiving applications.  Think of browsers and HTML.
Suffice it to say, MARC applications are quite likely to be forgiving of leader
positions 20-23.  In my non-conforming MARC file and in Bill's, the leader
positions 20-21 (45) seemed constant, but things could fall apart for
positions 22-23.  So...

I present the following (in-line and attached, to preserve tabs) in an

attempt to straddle the two sides of this issue: applications forgiving of non-
conforming files.  Should the two characters following 45 (at position 20)
*not* be 00, then the identification will be noted as non-conforming.  We
could classify this as reasonable identification but hardly ironclad (indeed,
simply checking to confirm that part of the first 24 positions match the
specification hardly constitutes a robust identification, but it's something).

It will also give you a mimetype too, now.

Would any like testing it out more fully on their own files?


#
# MARC 21 Magic  (Third cut)

# Set at position 0
0   bytex

# leader position 20-21 must be 45

20  string  45

# leader starts with 5 digits, followed by codes specific to MARC
format

0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic

!:mime  application/marc

0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority

!:mime  application/marc

0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings

!:mime  application/marc

0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification

!:mime  application/marc

0   regex/1 (^[0-9]{5})[cdn][q] MARC Community

!:mime  application/marc

# leader position 22-23, should be 00 but is it?

0   regex/1 (^.{21})([^0]{2})   (non-conforming)

!:mime  application/marc


If this works, I'll see about submitting this copy.  Thanks to all your efforts

already.

Warmly,

Kevin

--
Library of Congress
Network Development and MARC Standards 

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Prettyman, Timothy
Just as a historical note, this non-standard use of LDR/22 is likely due to 
OCLC's use of the character as a hexadecimal flag from back in the days when 
marc records were mostly schlepped around on tapes.  They referred to it as the 
Transaction type code.  When records were sent to oclc for processing, 
various values of the flag indicated whether a catalog card was to be produced, 
whether the record was an update, whether the user location symbol was to be 
set, etc.  I'm sure others have used it for their own nefarious purposes as 
well.

Tim Prettyman
University of Michigan/LIT

On Apr 6, 2011, at 12:28 PM, Ford, Kevin wrote:

 Well, this brings us right up against the issue of files that adhere to their 
 specifications versus forgiving applications.  Think of browsers and HTML.  
 Suffice it to say, MARC applications are quite likely to be forgiving of 
 leader positions 20-23.  In my non-conforming MARC file and in Bill's, the 
 leader positions 20-21 (45) seemed constant, but things could fall apart 
 for positions 22-23.  So...
 
 I present the following (in-line and attached, to preserve tabs) in an 
 attempt to straddle the two sides of this issue: applications forgiving of 
 non-conforming files.  Should the two characters following 45 (at position 
 20) *not* be 00, then the identification will be noted as non-conforming.  
 We could classify this as reasonable identification but hardly ironclad 
 (indeed, simply checking to confirm that part of the first 24 positions match 
 the specification hardly constitutes a robust identification, but it's 
 something).
 
 It will also give you a mimetype too, now.
 
 Would any like testing it out more fully on their own files?
 
 
 #
 # MARC 21 Magic  (Third cut)
 
 # Set at position 0
 0 bytex   
 
 # leader position 20-21 must be 45
 20   string  45  
 
 # leader starts with 5 digits, followed by codes specific to MARC format
 0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[cdn][q] MARC Community
 !:mimeapplication/marc
 
 # leader position 22-23, should be 00 but is it?
 0   regex/1 (^.{21})([^0]{2})   (non-conforming)
 !:mimeapplication/marc
 
 
 If this works, I'll see about submitting this copy.  Thanks to all your 
 efforts already.
 
 Warmly,
 
 Kevin
 
 --
 Library of Congress
 Network Development and MARC Standards Office
 
 
 
 
 
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero 
 [s...@unc.edu]
 Sent: Sunday, April 03, 2011 14:01
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARC magic for file
 
 I am pretty sure that the marc4j standard reader ignores them; the tolerant
 reader definitely does. Otherwise JHU might have about two parseable records
 based on the mangled leaders that J-Rock  gets stuck with :-)
 
 An analysis of the ~7M LC bib records from the scriblio.net data files (~
 Dec 2006) indicated that leader  has less than 8 bits of information in it
 (shannon-weaver definition). This excludes the initial length value, which
 is redundant given the end of record marker.
 
 
 The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
 The final characters of the leader are 450.
 
 Also, I object to the phrase decent MARC tool.  Any tool capable of
 dealing with MARC as it exists cannot afford the luxury of decency :-)
 
 [ HA: A clear conscience?
 BW: Yes, Sir Humphrey.
 HA: When did you acquire this taste for luxuries?]
 
 Simon
 
 On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens o...@ostephens.com wrote:
 
 I'm sure any decent MARC tool can deal with them, since decent MARC tools
 are certainly going to be forgiving enough to deal with four characters
 that
 apparently don't even really matter.
 
 You say that, but I'm pretty sure Marc4J throws errors MARC records where
 these characters are incorrect
 
 Owen
 
 On Fri, Apr 1, 2011 at 3:51 AM, William Denton w...@pobox.com wrote:
 
 On 28 March 2011, Ford, Kevin wrote:
 
 I couldn't get Simon's MARC 21 Magic file to work.  Among other issues,
 I
 received line too long errors.  But, since I've been curious about
 this
 for sometime, I figured I'd take a whack at it myself.  Try this:
 
 
 This is very nice!  Thanks.  I tried it on a bunch of MARC files I have,
 and it recognized almost all of them.  A few it didn't, so I had a closer
 look, and they're invalid.
 
 For example, the Internet Archive's Binghamton catalogue dump:
 
 http://ia600307.us.archive.org/6/items/marc_binghamton_univ/
 
 $ file -m marc.magic bgm*mrc
 bgm_openlib_final_0-5.mrc: data
 

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually -- I'd disagree because that is a very narrow view of the 
specification.  When validating MARC, I'd take the approach to validate 
structure (which allows you to then read any MARC format) -- then use a 
separate process for validating content of fields, which in my opinion, is more 
open to interpretation based on system usage of the data.  For example, 22 and 
23 are undefined values that local systems may very well have a practical need 
to define and use given that there are only so many values in the leader.  This 
is why I sometimes see additional values in the 09 field (which should be a or 
blank) to define different character set types, or additional elements added to 
other fields.  If I want to validate the content of those fields, I'd validate 
it through a different process -- but I separate the process from the 
validation of the structure -- because the two are not exclusive.

--TR

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Wednesday, April 06, 2011 9:59 AM
 To: Code for Libraries
 Cc: Reese, Terry
 Subject: Re: [CODE4LIB] MARC magic for file
 
 I'm not sure what you mean Terry.  Maybe we have different understandings
 of valid.
 
 If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not a
 valid Marc21 file. It violates the Marc21 specification.
 
 Now, they may still be _usable_, by software that ignores these bytes
 anyway or works around them. We definitely have a lot of software that
 does that.
 
 Which can end up causing problems that remind me of very analagous
 problems caused by the early days of web browsers that felt like being
 'tolerant' of bad data. My html works in every web brower BUT this one,
 why not? Oh, becuase that's the only one that actually followed the
 standard, oops.
 
 I actually ran into an example of that problem with this exact issue.
 MOST software just ignores marc leader bytes 20-23, and assumes the
 semantics of 4500---the only legal semantics for Marc21.  But Marc4j
 actually _respected_ them, apparently the author thought that some marc in
 the wild might intentionally set different bytes here (no idea if that's true 
 or
 not). So if the leader bytes 20-23 were invalid
 (according to the spec), Marc47 would suddenly decide that the length of
 field portion was NOT 4, but actually BELIEVE whatever was in leader byte
 20, causing the record to be parsed improperly.  And I had records like that
 coming out of my ILS (not even a vendor database). That was an unfun
 couple days of debugging to figure out what was going on.
 
 On 4/6/2011 12:52 PM, Reese, Terry wrote:
  Actually, you can have records that are MARC21 coming out of vendor
 databases (who sometime embed control characters into the leader) and still
 be valid.  Once you stop looking at just your ILS or OCLC, you probably
 wouldn't be surprised to know that records start looking very different.
 
  --TR
 
 
  
  Terry Reese, Associate Professor
  Gray Family Chair
  for Innovative Library Services
  121 Valley Libraries
  Corvallis, Or 97331
  tel: 541.737.6384
  
 
 
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
  Of Jonathan Rochkind
  Sent: Wednesday, April 06, 2011 9:44 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] MARC magic for file
 
  Can't you have a legal MARC file that does NOT have 4500 in those
  leader positions?  It's just not legal Marc21, right?   Other marc
  formats may specify or even allow flexibility in the things these
  bytes
  specify:
 
  * Length of the length-of-field portion
  * Number of characters in the starting-character-position portion of
  a Directory entry
  * Number of characters in the implementation-defined portion of a
  Directory entry
 
  Or, um, 23, which is I guess is left to the specific Marc
  implementation (ie,
  Marc21 is one such) to use for it's own purposes.
 
  I have no idea how that should inform the 'marc magic'.
 
  Is mime-type application/marc defined as specifically Marc21, or as
  any Marc?
 
  Jonathan
 
  On 4/6/2011 12:28 PM, Ford, Kevin wrote:
  Well, this brings us right up against the issue of files that adhere
  to their
  specifications versus forgiving applications.  Think of browsers and HTML.
  Suffice it to say, MARC applications are quite likely to be forgiving
  of leader positions 20-23.  In my non-conforming MARC file and in
  Bill's, the leader positions 20-21 (45) seemed constant, but things
  could fall apart for positions 22-23.  So...
  I present the following (in-line and attached, to preserve tabs) in
  an
  attempt to straddle the two sides of this issue: applications
  forgiving of non- conforming files.  Should the two characters
  following 45 (at position 20)
  *not* be 00, then the identification will be noted as
  non-conforming.  We could classify this as reasonable
  identification but hardly ironclad 

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread William Denton

On 6 April 2011, Reese, Terry wrote:

Actually -- I'd disagree because that is a very narrow view of the 
specification.  When validating MARC, I'd take the approach to validate 
structure (which allows you to then read any MARC format) -- then use a 
separate process for validating content of fields, which in my opinion, 
is more open to interpretation based on system usage of the data.


What do you think is the best way to recognize MARC files (up to some 
level of validity, given all the MARC you've seen and parsed) that could 
be made to work the way magic is defined?


Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
I'm honestly not family with magic.  I can tell you in MarcEdit, the way that 
the process works is there is a very generic function that reads the structure 
of the data not trusting the information in the leader (since I find this data 
very un-reliable).  Then, if users want to apply a set of rules to the 
validation -- I apply those as a secondary process.  If you are looking to 
validate specific content within a record, then what you want to do in this 
function may be appropriate -- though you'll find some local systems will 
consistently fail the process.

--tr


From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of William Denton 
[w...@pobox.com]
Sent: Wednesday, April 06, 2011 10:29 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

On 6 April 2011, Reese, Terry wrote:

 Actually -- I'd disagree because that is a very narrow view of the
 specification.  When validating MARC, I'd take the approach to validate
 structure (which allows you to then read any MARC format) -- then use a
 separate process for validating content of fields, which in my opinion,
 is more open to interpretation based on system usage of the data.

What do you think is the best way to recognize MARC files (up to some
level of validity, given all the MARC you've seen and parsed) that could
be made to work the way magic is defined?

Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


[CODE4LIB] Fwd: OAC RFP Annoncement

2011-04-06 Thread Robert Sanderson
Forwarded:

The Open Annotation Collaboration (OAC) project is pleased to announce
a Request For Proposal to collaborate with OAC researchers for
building implementations of the OAC data model and ontology. The OAC
is seeking to collaborate with scholars and/or librarians currently
using and/or curating established repositories of scholarly digital
resources with well-defined audiences of scholars. The OAC intends to
fund a set of four projects that are complementary in content media
type and use cases that leverage the OAC Data Model to the fullest
extent, and that leverage existing annotation tools or at least have
articulated an interesting scholarly annotation use case.

Two of the successful Respondents will collaborate with OAC
researchers at the University of Maryland and the other two will
collaborate with OAC research at the University of Illinois at
Urbana-Champaign. (For these collaborations, Illinois and Maryland
will provide guidance on the implementation of the OAC data model and
ontology, help in defining extensions of the data model that might be
necessary, advice on existing tools that might be adaptable for the
demonstration experiment, feedback on correctness of mappings from/to
native annotation formats and/or annotations created.)

The full text of the RFP can be found at
http://www.openannotation.org/documents/openAnnotationRFP.pdf

The IP agreement attachment to this RFP is available at:
http://www.openannotation.org/documents/openAnnotationIP_Agreement_forRFP.pdf

A FAQ about this RFP is available at:
http://www.openannotation.org/RFP_FAQs.html

Please make all submissions regarding this RFP, including your letter
of intent and proposal, to oac2...@support.lis.illinois.edu

Questions: regarding any details of this RFP should also be emailed to
oac2...@support.lis.illinois.edu; answers to substantive questions
from individuals will be posted immediately on the RFP FAQ page
mentioned above (so as to available to all proposers).

The Open Annotation Collaboration is supported by a grant from the
Andrew W. Mellon Foundation. OAC members include the University of
Illinois at Urbana-Champaign, the University of Maryland, the
University of Queensland (Australia), and the Los Alamos National
Laboratory.

Regards,

Jacob Jett
Assistant Coordinator, Open Annotation Collaboration Project
Center for Informatics Research in Science and Scholarship
The Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Kyle Banerjee
..  Maybe we have different understandings of valid.

 If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not
 a valid Marc21 file. It violates the Marc21 specification.

 Now, they may still be _usable_, by software that ignores these bytes
 anyway or works around them. We definitely have a lot of software that does
 that.

 Which can end up causing problems that remind me of very analagous problems
 caused by the early days of web browsers that felt like being 'tolerant' of
 bad data. My html works in every web brower BUT this one, why not? Oh,
 becuase that's the only one that actually followed the standard, oops.


There is some question as to what value there is in validating fields that
have no meaning by definition. What benefit does validating an undefined
value have other than create an opportunity to break things and slow the
process down just a little? The entire concept of an invalid entry in an
undefined field (e.g byte 23) is oxymoronic.

I'd go so far as to question the value of validating redundant data that
theoretically has meaning but which are never supposed to vary. The 4 and
the 5 simply repeat what is already known about the structure of the MARC
record. Choking on stuff like this is like having a web browser ask you want
to do with a page because it lacks a document type declaration.

Garbage data is the reality, so having parsers stop when they encounter data
they don't actually need unnecessarily complicates things. That kind of
stuff should generate a warning at worst.

kyle


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind

Actually -- I'd disagree because that is a very narrow view of the
specification.  When validating MARC, I'd take the approach to validate
structure (which allows you to then read any MARC format) -- then use a
separate process for validating content of fields, which in my opinion,
is more open to interpretation based on system usage of the data.


Wait, so is there any formal specification of validity that you can 
look at to determine your definition of validity, or it's just well, 
if I can recover it into useful data, using my own algorithms


I think we computer programmers are really better-served by reserving 
the notion of validity for things specified by formal specifications 
-- as we normally do, talking about any other data format.   And the 
only formal specifications I can find for Marc21 say that leader bytes 
20-23 should be 4500. (Not true of Marc in general just Marc21).


Now it may very well be (is!) true that the library community with Marc 
have been in the practice of tolerating working Marc that is NOT valid 
according to any specification.   So, sure, we may need to write 
software to take account of that sordid history. But I think it IS a 
sordid history -- not having a specification to ensure validity makes it 
VERY hard to write any new software that recognizes what you expect it 
to be recognize, because what you expect it to recognize isn't formally 
specified anywhere. It's a problem.  We shouldn't try to hide the 
problem in our discussions by using the word valid to mean something 
different than we use it for any modern data format. valid only has a 
meaning when you're talking about valid according to some specific 
specification.


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind

On 4/6/2011 2:02 PM, Kyle Banerjee wrote:

I'd go so far as to question the value of validating redundant data that
theoretically has meaning but which are never supposed to vary. The 4 and
the 5 simply repeat what is already known about the structure of the MARC
record. Choking on stuff like this is like having a web browser ask you want
to do with a page because it lacks a document type declaration.


Well, the problem is when the original Marc4J author took the spec at 
it's word, and actually _acted upon_ the '4' and the '5', changing file 
semantics if they were different, and throwing an exception if it was a 
non-digit.


This actually happened, I'm not making this up!  Took me a while to debug.

So do you think he got it wrong?  How was he supposed to know he got it 
wrong, he wrote to the spec and took it at it's word. Are you SURE there 
aren't any Marc formats other than Marc21 out there that actually do use 
these bytes with their intended meaning, instead of fixing them? How was 
the Marc4J author supposed to be sure of that, or even guess it might be 
the case, and know he'd be serving users better by ignoring the spec 
here instead of following it?  What documents instead of the actual 
specifications should he have been looking at to determine that he ought 
not to be taking those bytes at their words, but just ignoring them?


To realize that we have so much non-conformant data out there that we're 
better off ignoring these bytes, is something you can really only learn 
through experience -- and something you can then later realize you're 
wrong on too:


Ie: I _thought_ I was writing only for Marc21, but then it turns out 
I've got to accept records from Outer Weirdistan that are a kind of 
legal Marc that actually uses those bytes for their intended meaning -- 
better go back and fix my entire software stack, involving various 
proprietary and open source products from multiple sources, each of 
which has undocumented behavior when it comes to these bytes, maybe they 
follow the spec or maybe the follow Kyle's advice, but they don't tell 
me.  This is a mess.


Maybe this scenario is impossible, maybe there ARE and NEVER HAVE BEEN 
any Marc variants that actually use leader bytes 20-22 in this way -- 
how can I determine that?  I've just got to guess and hope for the 
best.  The point of specifications in the first place is for 
inter-operability, so we know that if all software and data conforms to 
the spec, then all software and data will interact in expected ways.  
Once we start guessing at which parts of the spec we really ought to be 
ignoring


Again, I realize in the actual environment we've got, this is not a 
luxury we have. But it's a fault, not a benefit, to have lots of 
software everywhere behaving in non-compliant ways and creating invalid 
(according to the spec!) data.


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread William Denton

On 6 April 2011, Jonathan Rochkind wrote:

I think we computer programmers are really better-served by reserving the 
notion of validity for things specified by formal specifications -- as we 
normally do, talking about any other data format.   And the only formal 
specifications I can find for Marc21 say that leader bytes 20-23 should be 
4500. (Not true of Marc in general just Marc21).


Validity does mean something definite ... but Postel's Law is a good 
guideline, especially with the swamp of bad MARC, old MARC, alternate 
MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake 
of file and its magic---we can identify technically invalid but still 
usable MARC, that's good.


Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind

On 4/6/2011 2:43 PM, William Denton wrote:


Validity does mean something definite ... but Postel's Law is a good
guideline, especially with the swamp of bad MARC, old MARC, alternate
MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake
of file and its magic---we can identify technically invalid but still
usable MARC, that's good.


Hmm, accept in the case of Web Browsers, I think general consensus is 
Postel's law was not helpful. These days, most people seem to think that 
having different browsers be tolerant of invalid data in different ways 
was actually harmful rather than helpful to inter-operability (which is 
theoretically the goal of Postel's law), and that's not what people do 
anymore in web browser land, at least not to the extremes they used to 
do it.


So Postel's Law may not be a universal.  Although marc data may or may 
not be analagous to a web browser/html. :)  It doesn't _really_ matter, 
cause we're stuck with the legacy we're stuck with, there's no changing 
it now. But there are real world negative consequences to it, some of 
which I've tried to explain in previous messages. (And still don't call 
it validity if it's not please! But yes, sometimes insisting on strict 
validity is not the appropriate solution).


Also note that assuming that byte 20-21 is 45 even when it's something 
else is possibly not something Postel would accept as an application of 
his law -- unless you document your software specifically as working 
only with Marc21, and not any Marc.


[Postel's Law: Be conservative in what you send; be liberal in what you 
accept. http://en.wikipedia.org/wiki/Robustness_principle  .  That wiki 
page also notes the general category of downside in following Postel's 
law, which is what was encountered with HTML, and which _I've_ 
encountered with MARC:  For example, a defective implementation that 
sends non-conforming messages might be used only with implementations 
that tolerate those deviations from the specification until, possibly 
several years later, it is connected with a less tolerant application 
that rejects its messages. In such a situation, identifying the problem 
is often difficult, and deploying a solution can be costly. 


Yes, identifying the problem and deploying the solution was costly, in 
my MARC case, although it definitely could have been worse. ]


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Kyle Banerjee
 Well, the problem is when the original Marc4J author took the spec at it's
 word, and actually _acted upon_ the '4' and the '5', changing file semantics
 if they were different, and throwing an exception if it was a non-digit.

At least the author actually used the values rather than checking to see if
a 4 or 5 were there. I still don't see what the point of looking for a 0 in
an undefined field would be. I'm wondering what kind of nut job would write
this into the standard, but that's not the author's problem.


 Do you think he got it wrong?  How was he supposed to know he got it wrong,
 he wrote to the spec and took it at it's word. Are you SURE there aren't any
 Marc formats other than Marc21 out there that actually do use these bytes
 with their intended meaning, instead of fixing them?


I wouldn't call it wrong -- the spec is a logical point of departure. MARC21
derives from an ISO standard that does not use those character positions and
which otherwise requires the same data layout, but the author wouldn't
necessarily know that.

Standards have something in common with laws in that how they are used in
the real world is as or more important than what is actually defined --
what's written and what's done in practice can be very different.

Everyone here who has parsed catalog data who has done an ILS migration
knows better than to just think for a second that fields can be assumed to
be used as defined except for very basic stuff.


 How was the Marc4J author supposed to be sure of that, or even guess it
 might be the case, and know he'd be serving users better by ignoring the
 spec here instead of following it?


There might not have been a good way to know. With data, one thing you
always want to do is ask a bunch of people who work with it all the time
about anomalies in the wild. Many great works of fiction masquerade as
documents which supposedly describe reality.


 Ie: I _thought_ I was writing only for Marc21, but then it turns out I've
 got to accept records from Outer Weirdistan that are a kind of legal Marc
 that actually uses those bytes for their intended meaning


Any such MARC as it would be noncompliant with the ISO standard from which
MARC21 hails. If working from the MARC21 standard and weird records are in
question, there would be a greater chance of choking on nonumeric tags as
those are allowed by the ISO standard.

Ignoring that MARC21 would need to be redefined to be able to take on other
values, one can safely conclude that such a redefinition could only be
written by totally deranged individuals. Values lower than 4 and 5
respectively would limit record length to the point little or no data could
be stored, and greater values would be completely nonsensical as the MARC
record length limitation would mean that the extra space allocated by the
digits could only contain zeros.

In any case, MARC is a legacy standard from the 60's. The chances of new
flavors emerging are dismal at best.


 Again, I realize in the actual environment we've got, this is not a luxury
 we have. But it's a fault, not a benefit, to have lots of software
 everywhere behaving in non-compliant ways and creating invalid (according to
 the spec!) data.

Creating is another matter entirely. Since we can control what we create
ourselves, we make things a little better every time we make things
comformant. However, we can't control what others do and being able to read
everything is useful, including stuff created using tools/processes that
aren't up to scratch.

kyle


[CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Eric Lease Morgan
Ack! While using the venerable Perl MARC::Batch module I get the following 
error while trying to read a MARC record:

  utf8 \xC2 does not map to Unicode

This is a real pain, and I'm hoping someone here can help me either: 1) trap 
this error allowing me to move on, or 2) figure out how to open the file 
correctly.

-- 
Eric Morgan


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jonathan Rochkind
I am not familar with that Perl module. But I'm more familiar then I'd 
want with char encoding in Marc.


I don't recognize the bytes 0xC2 (there are some bytes I became 
pathetically familiar with in past debugging, but I've forgotten em), 
but the first things to look at:


1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8. 
Theoretically there is a Marc leader byte that tells you whether it's 
Marc8 or UTF-8, but the leader byte is often wrong in real world 
records.  Is it wrong?


2. Does Perl MARC::Batch  have a function to convert from Marc8 to 
UTF-8?   If so, how does it decide whether to convert? Is it trying to 
do that?  Is it assuming that the leader byte the record accurately 
identifies the encoding, and if so, is the leader byte wrong?   Is it 
trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the 
first place?  Or is it assuming the source was UTF-8 in the first place, 
when in fact it was Marc8?


Not the answer you wanted, maybe someone else will have that. Debugging 
char encoding is hands down the most annoying kind of debugging I ever do.


On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:

Ack! While using the venerable Perl MARC::Batch module I get the following 
error while trying to read a MARC record:

   utf8 \xC2 does not map to Unicode

This is a real pain, and I'm hoping someone here can help me either: 1) trap this error 
allowing me to move on, or 2) figure out how to open the file correctly.



Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread LeVan,Ralph
Can you share the record somewhere?  I suspect many of us have tools we
can turn loose on it.

Ralph

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
Of
 Jonathan Rochkind
 Sent: Wednesday, April 06, 2011 4:28 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
 
 I am not familar with that Perl module. But I'm more familiar then I'd
 want with char encoding in Marc.
 
 I don't recognize the bytes 0xC2 (there are some bytes I became
 pathetically familiar with in past debugging, but I've forgotten em),
 but the first things to look at:
 
 1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
 Theoretically there is a Marc leader byte that tells you whether it's
 Marc8 or UTF-8, but the leader byte is often wrong in real world
 records.  Is it wrong?
 
 2. Does Perl MARC::Batch  have a function to convert from Marc8 to
 UTF-8?   If so, how does it decide whether to convert? Is it trying to
 do that?  Is it assuming that the leader byte the record accurately
 identifies the encoding, and if so, is the leader byte wrong?   Is it
 trying to convert from Marc8 to UTF-8, when the source was UTF-8 in
the
 first place?  Or is it assuming the source was UTF-8 in the first
place,
 when in fact it was Marc8?
 
 Not the answer you wanted, maybe someone else will have that.
Debugging
 char encoding is hands down the most annoying kind of debugging I ever
do.
 
 On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
  Ack! While using the venerable Perl MARC::Batch module I get the
following
 error while trying to read a MARC record:
 
 utf8 \xC2 does not map to Unicode
 
  This is a real pain, and I'm hoping someone here can help me either:
1) trap
 this error allowing me to move on, or 2) figure out how to open the
file correctly.
 


[CODE4LIB] **SKOS-2-HIVE: CREATING SKOS VOCABULARIES TO HELP INTERDISCIPLINARY VOCABULARY ENGINEERING**

2011-04-06 Thread Kevin S. Clarke
Forwarding because I think this will be of interest to some folks on the
list...

-- Forwarded message --



***SKOS-2-HIVE: CREATING SKOS VOCABULARIES TO HELP INTERDISCIPLINARY
VOCABULARY ENGINEERING***



We are pleased to announce the addition of more HIVE workshops!


*DATES AND LOCATIONS*

*April 29, 2011

University of North Texas, Denton, Texas;

Registration Deadline: April 20th

*Click Here to Register for Texas Workshop
 http://tinyurl.com/4f39ye6

*May 20, 2011

Columbia University, New York City;

Registration Deadline: May 10th*

Click Here to Register for New York Workshop http://tinyurl.com/4fdode9%20


*California-based workshop date to be determined!


If your institution is interested in hosting a workshop, please contact at:*
  hive.workshop2...@gmail.com

*WORKSHOP DESCRIPTION*

*SKOS-2-HIVE* workshops focus on using semantic web technologies for
representing and describing collections using multiple controlled
vocabularies. The workshop focuses on basic understanding and usage of W3C's
Simple Knowledge Organization Systems
(SKOShttp://www.w3.org/2004/02/skos/), linked
data, and the HIVE library of open source applications.

There are two workshop components:

1. *Foundational Concepts and HIVE Basics*. This component addresses the
conceptual design of structured vocabularies, including a range of semantic
relationships; domain representation and issues central to identifying
useful vocabularies; the application of basic SKOS tags; and basic
techniques underlying the HIVE vocabulary server for enriching
digital resource descriptions.

2. I*mplementing HIVE*. This component covers more technical aspects
including steps for implementing a HIVE server.

Workshop outlines and learning outcomes provided further below.
Workshop rationale: Semantic web technologies provide innovative means for
organizing, describing, and managing digital resources in a range of
formats. Successful implementation and use of semantic web technologies
requires both information professionals and system developers to become
knowledgeable about the underlying intellectual construct and roadmap toward
forming a semantic web. The IMLS-funded Helping Interdisciplinary Vocabulary
Engineering (HIVE) https://www.nescent.org/sites/hive/Main_Page project
has been addressing these needs by working with the W3C's Simple Knowledge
Organization Systems (SKOS) http://www.w3.org/TR/skos-reference/ in the
linked data environment. HIVE has been implemented using semantic web
enabling technologies and machine learning to provide a solution to the
traditional controlled vocabulary problems of cost, interoperability, and
usability. Current HIVE vocabulary partners include the Library of
Congresshttp://www.loc.gov/index.html,
the Getty Research Institute http://www.getty.edu/research/, and the U.S.
Geological Survey http://www.usgs.gov/.

*
**WORKSHOP OUTLINE AND LEARNING OUTCOMES*

*Morning Session: Foundational Concepts and HIVE Basics, 8:30 AM-12:00 PM*

*Overview*

This session addresses traditional thesaural concepts and the extension of
these concepts via SKOS/linked data, HIVE and the semantic web.

*Audience*

This workshop targets information professionals (librarians, archivists,
museum professional, web architects, and others); system developers; and
students seeking knowledge about the basic framework and conceptual aspect
of vocabulary design.

*Prerequisites*

Have a basic understanding of subject metadata creation or subject
cataloging.

*Learning Outcomes*

- Evaluate controlled vocabulary, thesauri, and ontologies that would best
fit your information environment's needs.

- Identify basic thesaural relationships including: relative, associative
and hierarchical.

- Use basic SKOS tags to identify the above thesaural relationships.

- Become familiar with using the HIVE software and the HIVE processes.

*
**Lunch on your own 12:00 PM-1:00 PM*

*
**Afternoon Session: Implementing HIVE 1:00 PM-4:30 PM*

*Overview*

This session provides details on the HIVE system, underlying algorithms,
source code, and the library of system features.

*Audience*

System developers, as well as technologists, librarians, and information
scientists who are interested in the technological side of the semantic web,
and who may be implementing, experiments with, and/or extending HIVE
technologies.

*Prerequisites*

Java programming, and object oriented design.

*Learning Outcomes
*
- Understand the architecture of the HIVE vocabulary server.
- Become familiar with information retrieval techniques and how HIVE applies
them to vocabulary terms.
- Gain experience indexing documents with HIVE and KEA (a machine learning
application).
- Learn how to integrate HIVE vocabulary services into other tools.
- Learn how to use the SPARQL language for querying content in HIVE.

Click here to register for Texas Workshop http://tinyurl.com/4f39ye6

Click here to register for New York Workshop http://tinyurl.com/4fdode9%20

*Registration Fees

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Eric Lease Morgan
On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote:

 Ack! While using the venerable Perl MARC::Batch module I get the
 following error while trying to read a MARC record:
 
   utf8 \xC2 does not map to Unicode
 
 Can you share the record somewhere?  I suspect many of us have tools we
 can turn loose on it.

Sure, thanks. Try:

  http://zoia.library.nd.edu/tmp/tor.marc

-- 
Eric Lease Morgan


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Mike Taylor
On 6 April 2011 19:53, Jonathan Rochkind rochk...@jhu.edu wrote:
 On 4/6/2011 2:43 PM, William Denton wrote:

 Validity does mean something definite ... but Postel's Law is a good
 guideline, especially with the swamp of bad MARC, old MARC, alternate
 MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake
 of file and its magic---we can identify technically invalid but still
 usable MARC, that's good.

 Hmm, accept in the case of Web Browsers, I think general consensus is
 Postel's law was not helpful. These days, most people seem to think that
 having different browsers be tolerant of invalid data in different ways was
 actually harmful rather than helpful to inter-operability (which is
 theoretically the goal of Postel's law), and that's not what people do
 anymore in web browser land, at least not to the extremes they used to do
 it.

But the idea that browsers should be less permissive in what they
accept is a modern one that we now have the luxury of only because
adherence to Postel's law in the early days of the Web allowed it to
become ubiquitous.  Though it's true, as Harvey Thompson has observed
that it's difficult to retro-fit correctness, Clay Shirky was also
very right when he pointed out that You cannot simultaneously have
mass adoption and rigor.  If browsers in 1995 had been as pedantic as
the browsers of 2011 (rightly) are, we wouldn't even have the Web; or
if it existed at all it would just be a nichey thing that a few
scientists used to make their publications available to each other.

So while I agree that in the case of HTML we are right to now be
moving towards more rigorous demands of what to accept (as well, of
course, as being conservative in what we emit), I don't think we could
have made the leap from nothing to modern rigour.

-- Mike


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Reese, Terry
I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in 
MARC-8.  I'd guess the file isn't in UTF8.

--TR

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Wednesday, April 06, 2011 1:28 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
 
 I am not familar with that Perl module. But I'm more familiar then I'd want
 with char encoding in Marc.
 
 I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
 familiar with in past debugging, but I've forgotten em), but the first things 
 to
 look at:
 
 1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
 Theoretically there is a Marc leader byte that tells you whether it's
 Marc8 or UTF-8, but the leader byte is often wrong in real world records.  Is 
 it
 wrong?
 
 2. Does Perl MARC::Batch  have a function to convert from Marc8 to
 UTF-8?   If so, how does it decide whether to convert? Is it trying to
 do that?  Is it assuming that the leader byte the record accurately
 identifies the encoding, and if so, is the leader byte wrong?   Is it
 trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
 first place?  Or is it assuming the source was UTF-8 in the first place, when 
 in
 fact it was Marc8?
 
 Not the answer you wanted, maybe someone else will have that. Debugging
 char encoding is hands down the most annoying kind of debugging I ever do.
 
 On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
  Ack! While using the venerable Perl MARC::Batch module I get the
 following error while trying to read a MARC record:
 
 utf8 \xC2 does not map to Unicode
 
  This is a real pain, and I'm hoping someone here can help me either: 1) trap
 this error allowing me to move on, or 2) figure out how to open the file
 correctly.
 


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread LeVan,Ralph
Lol!

So right off the bat I see that the leader says the record is 1091 bytes
long, but it is actually 1089 bytes long and I end up missing the leader
for the next record.  Maybe a CR/LF problem?  I see that frequently as a
way to mangle MARC records when moving them around.

Is your problem in the very first record?

Ralph

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
Of
 Eric Lease Morgan
 Sent: Wednesday, April 06, 2011 4:55 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
 
 On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote:
 
  Ack! While using the venerable Perl MARC::Batch module I get the
  following error while trying to read a MARC record:
 
utf8 \xC2 does not map to Unicode
 
  Can you share the record somewhere?  I suspect many of us have tools
we
  can turn loose on it.
 
 Sure, thanks. Try:
 
   http://zoia.library.nd.edu/tmp/tor.marc
 
 --
 Eric Lease Morgan


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jonathan Rochkind
That's hilarious, that Terry has had to do enough ugliness with Marc 
encodings that he indeed can recognize 0xC2 off the bat as the Marc8 
encoding it represents!  I am in awe, as well as sympathy.


If the record is in Marc8, then you need to know if Perl Batch::Marc can 
handle Marc8.  If it's supposed to be able to handle it, you need to 
figure out why it's not. (leader byte says UTF-8 even though it's really 
Marc8?).


If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. 
The only software package I know of that can convert from and to Marc8 
encoding is Java Marc4J, but I wouldn't be shocked if there was 
something in Perl to do it. (But yes, as you can tell by the name, 
Marc8 is a character encoding ONLY used in Marc, nobody but library 
people write software for dealing with it).


On 4/6/2011 5:01 PM, Reese, Terry wrote:

I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in 
MARC-8.  I'd guess the file isn't in UTF8.

--TR


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Wednesday, April 06, 2011 1:28 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

I am not familar with that Perl module. But I'm more familiar then I'd want
with char encoding in Marc.

I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
familiar with in past debugging, but I've forgotten em), but the first things to
look at:

1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
Theoretically there is a Marc leader byte that tells you whether it's
Marc8 or UTF-8, but the leader byte is often wrong in real world records.  Is it
wrong?

2. Does Perl MARC::Batch  have a function to convert from Marc8 to
UTF-8?   If so, how does it decide whether to convert? Is it trying to
do that?  Is it assuming that the leader byte the record accurately
identifies the encoding, and if so, is the leader byte wrong?   Is it
trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
first place?  Or is it assuming the source was UTF-8 in the first place, when in
fact it was Marc8?

Not the answer you wanted, maybe someone else will have that. Debugging
char encoding is hands down the most annoying kind of debugging I ever do.

On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:

Ack! While using the venerable Perl MARC::Batch module I get the

following error while trying to read a MARC record:

utf8 \xC2 does not map to Unicode

This is a real pain, and I'm hoping someone here can help me either: 1) trap

this error allowing me to move on, or 2) figure out how to open the file
correctly.


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jon Gorman
I'm not quite convinced that it's marc-8 just because there's \xC2 ;).
 If you look at a hex dump I'm seeing a lot of what might be combining
characters.  The leader appears to have 'a' in the field to indicate
unicode.  In the raw hex I'm seeing a lot of  two character sequences
like: 756c 69c3 83c2 a872 (culir).  If I knew my utf-8 better, I
could guess what combining diacritics these are.  Doing a look up on
http://www.fileformat.info seems to indicate that this might be utf-8,
a 'DIAERESIS'

When debugging any encoding issue it's always good to know

a) how the records were obtained
b) how have they been manipulated before you touch them (basically,
how many times may they have been converted by some bungling process)?
c) what encoding they claim to be now?
and
d) what encoding they are, if any?


It's been a while since I used Marc::Batch.  Is there any reason
you're using that instead of just using MARC::Record?  I'd try just
creating a MARC::Record object.

I've seen people do really bizarre things to break MARC files such as
editing the raw binary, thus invalidating the leader and the directory
as the byte counts were no longer right)

I hate to say it, but we still come across files that are no longer in
any encoding due to too many bad conversions.  It's possible these are
as well.

The enca tool (haven't used it much) guesses this at utf-8 mixed w/
non-text data.

Jon


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread William Denton

On 6 April 2011, Eric Lease Morgan wrote:


 http://zoia.library.nd.edu/tmp/tor.marc


Happily, Kevin's magic formula recognizes this as MARC!

Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org