On Jan 23, 2009, at 5:52 AM, Eric Lease Morgan wrote:
On 1/23/09 4:39 AM, "Brown, Alan" <[email protected]> wrote:
Does anybody here know the difference between MARC21 and USMARC?
I am munging sets of MARC bibliographic data from a III catalog with
holdings data from the same. I am using MARC::Batch to read my bib'
data (with both strict and warnings turned off), insert 853 and 863
fields, and writing the data using the as_usmarc method.
Therefore, I
think I am creating USMARC files. I can then use marcdump to... dump
the records. It returns 0 errors.
Eric, This isn't an encoding thing is it? I know that a number of III
catalogues still encode their diacritics using the MARC8 version of
USMARC. We have changed ours to Unicode now, but we did have an
issue of
the catalogue outputting unicode records that weren't tagged as
such in
the leader and so couldn't be identified as proper MARC21 (current
version of USMARC). III have solved this with their latest release.
This
issue had me scratching my head with a lot of my MARC::Record
scripts,
but generally they failed quite spectacularly.
Actually, I believe I am suffering from a number of different types of
errors in my MARC data: 1) encoding issues (MARC8 versus UTF-8), 2)
syntactical errors (lack of periods, invalid choices of indicators,
etc.),
3) incorrect data types (strings entered into fields denoted for
integers,
etc.) Just about the only thing I haven't encountered are structural
errors
such as invalid leader, and this doesn't even take into account
possible
data entry errors (author is Franklin when Twain was entered).
Yes, I do have an encoding issue. All of my incoming records are in
MARC8.
I'm not sure, but I think the Primo tool expects UTF-8. I can easily
update
the encoding bit (change leader position 09 from blank to a), but
this does
not change any actual encoding in the bibliographic section of my
data.
Consequently, after updating the encoding bit and looping through my
munged
data MARC::Record chokes on records with the following error where
UTF-8 is
denoted but include MARC8 characters:
utf8 "\xE8" does not map to Unicode at
/usr/lib/perl5/5.8.8/i686-linux/Encode.pm line 166.
Upon looking at the raw MARC see the the offending record includes
the word
Münich. What can I do to transform MARC8 data into UTF-8? What can I
do to
trap the error above, and skip these invalid records?
We've had good luck with the yaz-marcdump utility that's included with
the YAZ toolkit. We're using it to convert our exported Horizon
records from MARC8 to UTF-8 before we import into AquaBrowser. The
tool is easy to compile, blindingly fast, forgiving of common MARC
errors, and changes the coding correctly. It's been serving us well.
-Tod
Tod Olson <[email protected]>
Systems Librarian
University of Chicago Library