Karen Coyle
Tue, 17 Mar 2009 16:40:37 -0700
Rebecca,Thank you so much for looking into this. As I understand the Unicode normal forms, it's not that one of them is more "correct" than others, it's a matter of circumstance and your particular needs. It does look like it would be good for program developers to document what form their program outputs in an effort to "save the time of the user."
kc Rebecca S Guenther wrote:
I ran this by a colleague here who has done a lot of these transformations, and he said the following: >From Morgan Cundiff: She says the "the MARC -> MARCXML program does not output Unicode Normal Form C". My first question would be "what program is that?". There are quite a few that do this. Whatever it is, she is probably right. I used Marc Report. I then used the perl script provided by OCLC to convert the marc slim file from Normalization Form D (decomposed) to Normalization Form C (composed). My understanding is that there is no Form C equivalent for a small number of the decomposed combinations used in marc records. So those stay decomposed. MorganRebecca S. Guenther Senior Networking and Standards Specialist Network Development and MARC Standards Office Library of Congress 101 Independence Ave. SE Washington, DC 20540 Washington, DC 20540-4402 (202) 707-5092 (voice) (202) 707-0115 (FAX) r...@loc.govDC-RDA automatic digest system <lists...@jiscmail.ac.uk> 3/16/2009 8:05 PM >>>Date: Mon, 16 Mar 2009 10:59:24 -0700 From: Karen Coyle <kco...@kcoyle.net> Subject: MARC and Unicode normalization forms Alistair had a large number of error messages about character set=20 problems when he processed records from MARC through various steps into R= DF: WARN [main] (RDFDefaultErrorHandler.java:36) - file:data/mods/part01-split16.mods.xml.rdf(line 249403 column 117): {W131= } String not in Unicode Normal Form C: "Muse=CC=81e bibliographique" WARN [main] (RDFDefaultErrorHandler.java:36) - file:data/mods/part01-split16.mods.xml.rdf(line 249340 column 184): {W131= } String not in Unicode Normal Form C: "Versuch einer kurzen Geschichte der ro=CC=88misch-catholischen deutschen Bibelu=CC=88bersetzung" While I can't explain why these particular examples get the error (and I=20 will keep looking at it), I have some evidence that the MARC -> MARCXML=20 program does not output Unicode Normal Form C. This causes display=20 problems for some characters (although not, as far as I know, the ones=20 in the examples). It is possible to translate the data into Form C if=20 needed. In any case, it looks like it isn't something that Alistair introduced=20 with his code. If I can figure out for sure that it's a MARCXML issue,=20 I'll suggest that code should be modified. kc --=20 ----------------------------------- Karen Coyle / Digital Library Consultantkco...@kcoyle.net http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenetfx.: 510-848-3913 mo.: 510-435-8234 ------------------------------------
-- ----------------------------------- Karen Coyle / Digital Library Consultant kco...@kcoyle.net http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenet fx.: 510-848-3913 mo.: 510-435-8234 ------------------------------------