dc-rda  

Re: MARC and Unicode normalization forms

Xavier Agenjo
Tue, 24 Mar 2009 11:18:27 -0700

What about ISO 25577,  I mean, MarcXchange  ?

Xavier Agenjo Bullón
Director de Proyectos
Fundación Ignacio Larramendi
Claudio Coello, 123, 4º
28006 Madrid
Telf.: (34) 915 81 25 37
Fax.:  (34) 915 81 47 36
xavier.age...@larramendi.es
www.larramendi.es

Certificado ISO 9001.
 P No imprimir si no es necesario. Protejamos el Medio Ambiente




-----Mensaje original-----
De: List for discussion on Resource Description and Access (RDA)
[mailto:dc-...@jiscmail.ac.uk] En nombre de Rebecca S Guenther
Enviado el: martes, 17 de marzo de 2009 20:34
Para: DC-RDA@JISCMAIL.AC.UK
Asunto: Re: MARC and Unicode normalization forms

I ran this by a colleague here who has done a lot of these transformations,
and he said the following:

>From Morgan Cundiff:
She says the "the MARC -> MARCXML program does not output Unicode Normal
Form C". My first question would be "what program is that?". There are quite
a few that do this.

Whatever it is, she is probably right. I used Marc Report. I then used the
perl script provided by OCLC to convert the marc slim file from
Normalization Form D (decomposed) to Normalization Form C (composed).

My understanding is that there is no Form C equivalent for a small number of
the decomposed combinations used in marc records. So those stay decomposed.

Morgan



Rebecca S. Guenther                                                       
 Senior Networking and Standards Specialist                  
 Network Development and MARC Standards Office     
 Library of Congress   
 101 Independence Ave. SE                                       
 Washington, DC 20540                                                      
 Washington, DC 20540-4402                                          
 (202) 707-5092 (voice)    (202) 707-0115 (FAX)           
 r...@loc.gov

>>> DC-RDA automatic digest system <lists...@jiscmail.ac.uk> 3/16/2009 
>>> 8:05 PM >>>

Date:    Mon, 16 Mar 2009 10:59:24 -0700
From:    Karen Coyle <kco...@kcoyle.net>
Subject: MARC and Unicode normalization forms

Alistair had a large number of error messages about character set=20
problems when he processed records from MARC through various steps into R=
DF:

WARN [main] (RDFDefaultErrorHandler.java:36) -
file:data/mods/part01-split16.mods.xml.rdf(line 249403 column 117): {W131= }
String not in Unicode Normal Form C: "Muse=CC=81e bibliographique"

WARN [main] (RDFDefaultErrorHandler.java:36) -
file:data/mods/part01-split16.mods.xml.rdf(line 249340 column 184): {W131= }
String not in Unicode Normal Form C: "Versuch einer kurzen Geschichte der
ro=CC=88misch-catholischen deutschen Bibelu=CC=88bersetzung"

While I can't explain why these particular examples get the error (and I=20
will keep looking at it), I have some evidence that the MARC -> MARCXML=20
program does not output Unicode Normal Form C. This causes display=20
problems for some characters (although not, as far as I know, the ones=20 in
the examples). It is possible to translate the data into Form C if=20
needed.

In any case, it looks like it isn't something that Alistair introduced=20
with his code. If I can figure out for sure that it's a MARCXML issue,=20
I'll suggest that code should be modified.

kc

--=20
-----------------------------------
Karen Coyle / Digital Library Consultant kco...@kcoyle.net
http://www.kcoyle.net 
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------