If what follows seems boring and you use MARC::Charset with any regularity just upgrade MARC::Charset to v0.97. If you are interested in knowing why read on...
Thanks for the details [1] Michael. You've uncovered a rather nasty bug in MARC::Charset >= v0.8. MARC::Charset::Compiler processes LCs MARC8/Unicode mapping file [2], but was not handling the fact that two character mappings (out of 16398) lacked a recommended <ucs> mapping, and relied on an <alt> instead. For example: <code> <isCombining>true</isCombining> <marc>EC</marc> <ucs/> <utf-8/> <alt>FE21</alt> <altutf-8>EFB8A1</altutf-8> <name>LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF</name> <note>...</note> </code> The result of this is that nulls were getting sprinkled in marc8_to_utf8 results if your data happened to contain either: - DOUBLE TILDE, SECOND HALF / COMBINING DOUBLE TILDE RIGHT HALF - LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF The good news is that MARC::Charset v0.97 has been released to CPAN with a fix to use <alt> when <ucs> is not available. The bad news is that if you've used MARC::Charset to convert to utf8 you may have nulls too. I'm sorry :-( If you use MARC::Charset **PLEASE** upgrade to v0.97 immediately. Thanks Michael O'Connor for noticing the bug, and Miker Rylander for the fix. Also going out in this release is a fix from Mike Rylander to allow \r and \n to pass unchanged through marc8_to_utf8 since \r and \n are reported to pop up occasionally in unimarc data. //Ed [1] http://www.nntp.perl.org/group/perl.perl4lib/2007/05/msg2507.html [2] http://www.loc.gov/marc/specifications/codetables.xml