This is definitely a pretty typical experience, alas.

Despite all of the library commmunities voiced obsession with doing things 'by the book' according to standards, anyone that's actually tried to work with an actually existing large corpus of MARC data.... finds that is is all over the place, and very non-compliant in many ways.

One of the most annoying things to deal with is that encoding issue. A US-MARC/MARC21 record can actually be in MARC-8 encoding OR in UTF-8, and there is actually a field (fixed field I think) to declare which encoding is used. But in actually existing MARC records, it is not uncommon for a record to declare itself as being in one encoding, but actually is in the other. This makes MARC records very difficult to deal with, definitely.

Jonathan

Eric Lease Morgan wrote:
On 1/23/09 4:39 AM, "Brown, Alan" <[email protected]> wrote:

Does anybody here know the difference between MARC21 and USMARC?

I am munging sets of MARC bibliographic data from a III catalog with
holdings data from the same. I am using MARC::Batch to read my bib'
data (with both strict and warnings turned off), insert 853 and 863
fields, and writing the data using the as_usmarc method. Therefore, I
think I am creating USMARC files. I can then use marcdump to... dump
the records. It returns 0 errors.
Eric, This isn't an encoding thing is it? I know that a number of III
catalogues still encode their diacritics using the MARC8 version of
USMARC. We have changed ours to Unicode now, but we did have an issue of
the catalogue outputting unicode records that weren't tagged as such in
the leader and so couldn't be identified as proper MARC21 (current
version of USMARC). III have solved this with their latest release. This
issue had me scratching my head with a lot of my MARC::Record scripts,
but generally they failed quite spectacularly.


Actually, I believe I am suffering from a number of different types of
errors in my MARC data: 1) encoding issues (MARC8 versus UTF-8), 2)
syntactical errors (lack of periods, invalid choices of indicators, etc.),
3) incorrect data types (strings entered into fields denoted for integers,
etc.) Just about the only thing I haven't encountered are structural errors
such as invalid leader, and this doesn't even take into account possible
data entry errors (author is Franklin when Twain was entered).

Yes, I do have an encoding issue. All of my incoming records are in MARC8.
I'm not sure, but I think the Primo tool expects UTF-8. I can easily update
the encoding bit (change leader position 09 from blank to a), but this does
not change any actual encoding in the bibliographic section of my data.
Consequently, after updating the encoding bit and looping through my munged
data MARC::Record chokes on records with the following error where UTF-8 is
denoted but include MARC8 characters:

  utf8 "\xE8" does not map to Unicode at
  /usr/lib/perl5/5.8.8/i686-linux/Encode.pm line 166.

Upon looking at the raw MARC see the the offending record includes the word
Münich. What can I do to transform MARC8 data into UTF-8? What can I do to
trap the error above, and skip these invalid records?


--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886 rochkind (at) jhu.edu

Reply via email to