If what follows seems boring and you use MARC::Charset with any
regularity just upgrade MARC::Charset to v0.97. If you are interested
in knowing why read on...

Thanks for the details [1] Michael. You've uncovered a rather nasty
bug in MARC::Charset >= v0.8.  MARC::Charset::Compiler processes LCs
MARC8/Unicode mapping file [2], but was not handling the fact that two
character mappings (out of 16398) lacked a recommended <ucs> mapping,
and relied on an <alt> instead. For example:

 <code>
   <isCombining>true</isCombining>
   <marc>EC</marc>
   <ucs/>
   <utf-8/>
   <alt>FE21</alt>
   <altutf-8>EFB8A1</altutf-8>
   <name>LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF</name>
   <note>...</note>
 </code>

The result of this is that nulls were getting sprinkled in
marc8_to_utf8 results if your data happened to contain either:

 - DOUBLE TILDE, SECOND HALF / COMBINING DOUBLE TILDE RIGHT HALF
 - LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF

The good news is that MARC::Charset v0.97 has been released to CPAN
with a fix to use <alt> when <ucs> is not available. The bad news is
that if you've used MARC::Charset to convert to utf8 you may have
nulls too. I'm sorry :-(

If you use MARC::Charset **PLEASE** upgrade to v0.97 immediately.
Thanks Michael O'Connor for noticing the bug, and Miker Rylander for
the fix.

Also going out in this release is a fix from Mike Rylander to allow \r
and \n to pass unchanged through marc8_to_utf8 since \r and \n are
reported to pop up occasionally in unimarc data.

//Ed

[1] http://www.nntp.perl.org/group/perl.perl4lib/2007/05/msg2507.html
[2] http://www.loc.gov/marc/specifications/codetables.xml

Reply via email to