Rebecca S Guenther
Tue, 17 Mar 2009 12:53:18 -0700
I ran this by a colleague here who has done a lot of these transformations, and he said the following:
>From Morgan Cundiff:
She says the "the MARC -> MARCXML program does not output Unicode Normal Form
C". My first question would be "what program is that?". There are quite a few
that do this.
Whatever it is, she is probably right. I used Marc Report. I then used the perl
script provided by OCLC to convert the marc slim file from Normalization Form D
(decomposed) to Normalization Form C (composed).
My understanding is that there is no Form C equivalent for a small number of
the decomposed combinations used in marc records. So those stay decomposed.
Morgan
Rebecca S. Guenther
Senior Networking and Standards Specialist
Network Development and MARC Standards Office
Library of Congress
101 Independence Ave. SE
Washington, DC 20540
Washington, DC 20540-4402
(202) 707-5092 (voice) (202) 707-0115 (FAX)
r...@loc.gov
>>> DC-RDA automatic digest system <lists...@jiscmail.ac.uk> 3/16/2009 8:05 PM
>>> >>>
Date: Mon, 16 Mar 2009 10:59:24 -0700
From: Karen Coyle <kco...@kcoyle.net>
Subject: MARC and Unicode normalization forms
Alistair had a large number of error messages about character set=20
problems when he processed records from MARC through various steps into R=
DF:
WARN [main] (RDFDefaultErrorHandler.java:36) -
file:data/mods/part01-split16.mods.xml.rdf(line 249403 column 117): {W131=
}
String not in Unicode Normal Form C: "Muse=CC=81e bibliographique"
WARN [main] (RDFDefaultErrorHandler.java:36) -
file:data/mods/part01-split16.mods.xml.rdf(line 249340 column 184): {W131=
}
String not in Unicode Normal Form C: "Versuch einer kurzen Geschichte der
ro=CC=88misch-catholischen deutschen Bibelu=CC=88bersetzung"
While I can't explain why these particular examples get the error (and I=20
will keep looking at it), I have some evidence that the MARC -> MARCXML=20
program does not output Unicode Normal Form C. This causes display=20
problems for some characters (although not, as far as I know, the ones=20
in the examples). It is possible to translate the data into Form C if=20
needed.
In any case, it looks like it isn't something that Alistair introduced=20
with his code. If I can figure out for sure that it's a MARCXML issue,=20
I'll suggest that code should be modified.
kc
--=20
-----------------------------------
Karen Coyle / Digital Library Consultant
kco...@kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596 skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------