On Thu, Jul 01, 2004 at 11:22:42AM -0400, Houghton,Andrew wrote: > I'm not sure what MARC::Charset does internally, but MARC-8 > defines the diacritic separate from the base character. So > even using binmode(STDOUT,":utf8") will produce two characters, > one for the base character followed by the diacritic. If you > want them combined then you need to combine them.
As you suggest Andy, MARC::Charset simply translates MARC-8 combining characters into UTF-8 combining characters. > It just so happens that I have recently been converting MARC-XML > to RDF. The RDF specification mandates Unicode Normal form C, > which means that the base character and the diacritic are > combined. MARC-XML uses Unicode Normal form D, which means that > the base character is separate from the diacritic. So I hacked > together some Perl scripts to convert Unicode NFD <-> Unicode NFC. > The scripts require Perl 5.8.0. Wow, I've always been under the impression that the character sets operated the same in RDF as they do in XML proper with the 'encoding' attribute: <?xml version="1.0" encoding="UTF-8" ?> > I was talking with a colleague, just yesterday, about whether we > should unleash these on the Net... They need to be cleaned up a > little and need some basic documentation on how to run the Perl > scripts. It would be nice to have them wrapped up with a module interface for use in non-command-line apps. I'd would be open to integrating this functionality into MARC::Charset if you are interested. //Ed