Hi, On 1/31/08, Galen Charlton <[EMAIL PROTECTED]> wrote: > As it happens, at this very moment I am working on some patches to > improve character set conversion, including adding support for > converting Latin-1 MARC records to UTF-8 from the command-line import > jobs. I should have something for you to test later today or > tomorrow.
This patch (against the current 3.0 tip) is now available for review at http://manage-gmc.dev.kohalibrary.com/patches/charset This introduces a new module, C4::Charset, to centralize code required for MARC character conversion in Koha. From the commit message: "IMPORTANT - refactor MARC character set handling Created a new module, C4::Charset, to centralize code for converting MARC records to UTF8. This module has three exported functions: * IsStringUTF8ish - determine if scalar contains a string in UTF8 * MarcToUTF8Record - convert MARC blob or MARC::Record to UTF8 * SetMarcUnicodeFlag - set appropriate MARC21 or UNIMARC field to indicate that record is in UTF-8. Design points of this module include: * No dependencies on other C4 modules, making it easier to add more test cases * All character conversion code in one place * Single entry point for doing a character conversion on a MARC record * Capture of errors and warnings produced by Text::Iconv and MARC::Charset * Start of support for guessing the source character set of a MARC record. Several functions were moved from other scripts or modules to C4::Charset: * C4::Koha->FixEncoding (expanded and renamed MarcToUTF8Record) * C4::Koha->char_decode5426 * fMARC8ToUTF8 from bulkmarcimport.pl (renamed _marc_marc8_to_utf8) Several batch jobs were adjusted to use MarcToUTF8Record instead of FixEncoding." As one of the effects of this patch, when the source character set of a MARC record is not known (e.g., the way bulkmarcimport currently works now), MarcToUTF8Record will now try converting the record from MARC-8, and if that results in errors, from Latin 1. However, I also intend to add an option to bulkmarcimport to explicitly specify the source encoding. I will add more test cases as I debug the module, so if any of you run into problems with character conversion, please file bugs or send me samples of the MARC records in question. Regards, Galen -- Galen Charlton Koha Application Developer LibLime [EMAIL PROTECTED] p: 1-888-564-2457 x709 _______________________________________________ Koha-devel mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/koha-devel
