Simon Spero
Mon, 16 Mar 2009 12:02:28 -0700
It's a lot easier to generate decomposed characters when converting from MARC-8 to unicode. Sometimes you even get NFD; otherwise diacritics can be backwards.
It can be easier to do some types of string matching on NFD (it's easier to
ignore diacritics), but when encoding HTML or XML, NFC is much better. LC
doesn't do NFC for its web interfaces, which is why accents often look wrong
in Firefox).
Simon
On Mon, Mar 16, 2009 at 1:59 PM, Karen Coyle <kco...@kcoyle.net> wrote:
> Alistair had a large number of error messages about character set problems
> when he processed records from MARC through various steps into RDF:
>
> WARN [main] (RDFDefaultErrorHandler.java:36) -
> file:data/mods/part01-split16.mods.xml.rdf(line 249403 column 117): {W131}
> String not in Unicode Normal Form C: "Musée bibliographique"
>
> WARN [main] (RDFDefaultErrorHandler.java:36) -
> file:data/mods/part01-split16.mods.xml.rdf(line 249340 column 184): {W131}
> String not in Unicode Normal Form C: "Versuch einer kurzen Geschichte der
> römisch-catholischen deutschen Bibelübersetzung"
>
> While I can't explain why these particular examples get the error (and I
> will keep looking at it), I have some evidence that the MARC -> MARCXML
> program does not output Unicode Normal Form C. This causes display problems
> for some characters (although not, as far as I know, the ones in the
> examples). It is possible to translate the data into Form C if needed.
>
> In any case, it looks like it isn't something that Alistair introduced with
> his code. If I can figure out for sure that it's a MARCXML issue, I'll
> suggest that code should be modified.
>
> kc
>
> --
> -----------------------------------
> Karen Coyle / Digital Library Consultant
> kco...@kcoyle.net http://www.kcoyle.net
> ph.: 510-540-7596 skype: kcoylenet
> fx.: 510-848-3913
> mo.: 510-435-8234
> ------------------------------------
>