dc-rda  

MARC and Unicode normalization forms

Karen Coyle
Mon, 16 Mar 2009 11:00:25 -0700

Alistair had a large number of error messages about character set problems when he processed records from MARC through various steps into RDF:

WARN [main] (RDFDefaultErrorHandler.java:36) -
file:data/mods/part01-split16.mods.xml.rdf(line 249403 column 117): {W131}
String not in Unicode Normal Form C: "Musée bibliographique"

WARN [main] (RDFDefaultErrorHandler.java:36) -
file:data/mods/part01-split16.mods.xml.rdf(line 249340 column 184): {W131}
String not in Unicode Normal Form C: "Versuch einer kurzen Geschichte der
römisch-catholischen deutschen Bibelübersetzung"

While I can't explain why these particular examples get the error (and I will keep looking at it), I have some evidence that the MARC -> MARCXML program does not output Unicode Normal Form C. This causes display problems for some characters (although not, as far as I know, the ones in the examples). It is possible to translate the data into Form C if needed.

In any case, it looks like it isn't something that Alistair introduced with his code. If I can figure out for sure that it's a MARCXML issue, I'll suggest that code should be modified.

kc

--
-----------------------------------
Karen Coyle / Digital Library Consultant
kco...@kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------