dc-rda  

Re: MARC and Unicode normalization forms

Rebecca S Guenther
Tue, 17 Mar 2009 12:53:18 -0700

I ran this by a colleague here who has done a lot of these transformations, and 
he said the following:

>From Morgan Cundiff:
She says the "the MARC -> MARCXML program does not output Unicode Normal Form 
C". My first question would be "what program is that?". There are quite a few 
that do this.

Whatever it is, she is probably right. I used Marc Report. I then used the perl 
script provided by OCLC to convert the marc slim file from Normalization Form D 
(decomposed) to Normalization Form C (composed).

My understanding is that there is no Form C equivalent for a small number of 
the decomposed combinations used in marc records. So those stay decomposed.

Morgan



Rebecca S. Guenther                                                       
 Senior Networking and Standards Specialist                  
 Network Development and MARC Standards Office     
 Library of Congress   
 101 Independence Ave. SE                                       
 Washington, DC 20540                                                      
 Washington, DC 20540-4402                                          
 (202) 707-5092 (voice)    (202) 707-0115 (FAX)           
 r...@loc.gov

>>> DC-RDA automatic digest system <lists...@jiscmail.ac.uk> 3/16/2009 8:05 PM 
>>> >>>

Date:    Mon, 16 Mar 2009 10:59:24 -0700
From:    Karen Coyle <kco...@kcoyle.net>
Subject: MARC and Unicode normalization forms

Alistair had a large number of error messages about character set=20
problems when he processed records from MARC through various steps into R=
DF:

WARN [main] (RDFDefaultErrorHandler.java:36) -
file:data/mods/part01-split16.mods.xml.rdf(line 249403 column 117): {W131=
}
String not in Unicode Normal Form C: "Muse=CC=81e bibliographique"

WARN [main] (RDFDefaultErrorHandler.java:36) -
file:data/mods/part01-split16.mods.xml.rdf(line 249340 column 184): {W131=
}
String not in Unicode Normal Form C: "Versuch einer kurzen Geschichte der
ro=CC=88misch-catholischen deutschen Bibelu=CC=88bersetzung"

While I can't explain why these particular examples get the error (and I=20
will keep looking at it), I have some evidence that the MARC -> MARCXML=20
program does not output Unicode Normal Form C. This causes display=20
problems for some characters (although not, as far as I know, the ones=20
in the examples). It is possible to translate the data into Form C if=20
needed.

In any case, it looks like it isn't something that Alistair introduced=20
with his code. If I can figure out for sure that it's a MARCXML issue,=20
I'll suggest that code should be modified.

kc

--=20
-----------------------------------
Karen Coyle / Digital Library Consultant
kco...@kcoyle.net http://www.kcoyle.net 
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------