On January 11, 2003 at 19:02, Tomohiro KUBOTA wrote: > I have a question on CharsetConverters. I am planning to > use UTF-8 filter like following. > > <CharsetConverters override> > plain; mhonarc::htmlize; > us-ascii; mhonarc::htmlize; > default; MHonArc::UTF8::str2sgml; MHonArc/UTF8.pm > </CharsetConverters> ... > This, I'd like to assume that raw 8bit characters are all KOI8-R > and convert these 8bit characters into either > - SGML entity expressions, > - &#xxx; expressions where xxx mean decimal Unicode codepoints, or > - UTF-8 characters. > How can I configure MHonArc to achieve this?
It looks like you will need to try out the latest development version to achieve what you want. The latest development version is now frozen for new functionality and is being evaluated for any major problems before release (and I'm looking for as many people willing to test things out before the release). You can grab a copy of the development version from <http://www.mhonarc.org/release/MHonArc/tar/>. Just grap one of the -snap bundles. With the latest code, much more character encoding support has been added, including Russian sets like KOI8-R. One way to get what you want with the latest snapshot build is with the following resource settings: <!-- Want everything to goto UTF-8 --> <CharsetConverters override> us-ascii; mhonarc::htmlize; default; MHonArc::UTF8::str2sgml; MHonArc/UTF8.pm </CharsetConverters> <!-- Make sure to register UTF-8-aware clipping function --> <TextClipFunc> MHonArc::UTF8::clip; MHonArc/UTF8.pm </TextClipFunc> <!-- Alias the special "plain" set to koi8-r to deal with improper mail headers --> <CharsetAliases> koi8-r; plain </CharsetAliases> <!-- If no charset specified, assumed koi8-r as the default instead of us-ascii --> <DefCharset> koi8-r </DefCharset> <!-- ... HERE define *PGBEGIN resource to denote utf-8 document character set with <meta http-equiv="content-type"> tag. See utf-8.mrc example resource file in distribution. ... --> Another way, would be: <!-- TEXTENCODE allows to map all character data to a given character encoding when messages are first read. --> <TextEncode> utf-8; MHonArc::UTF8::to_utf8; MHonArc/UTF8.pm </TextEncode> <-- With data translated to UTF-8, it simplifies CHARSETCONVERTERS --> <CharsetConverters override> default; mhonarc::htmlize </CharsetConverters> <-- Need to also register UTF-8-aware text clipping function --> <TextClipFunc> MHonArc::UTF8::clip; MHonArc/UTF8.pm </TextClipFunc> <!-- Alias the special "plain" set to koi8-r to deal with inproper mail headers --> <CharsetAliases> koi8-r; plain </CharsetAliases> <!-- If no charset specified, assumed koi8-r as the default instead of us-ascii --> <DefCharset> koi8-r </DefCharset> <!-- ... HERE define *PGBEGIN resource to denote utf-8 document character set with <meta http-equiv="content-type"> tag. See utf-8-encode.mrc example resource file in distribution. ... --> Using the TEXTENCODE method is probably more efficient overall. Make sure to test the above first to make sure things work as you want. If you have any problems, you should follow-up to the [EMAIL PROTECTED] mailing list since the above is not yet provided in an official release. The snapshot builds do contain updated documentation (excluding the nodoc bundles where the docs are not present). You can also check out the latest docs at <http://www.mhonarc.org/release/MHonArc/snapshot/doc/>. Check out the CHARSETCONVERTERS and TEXTENCODE resource pages for more details about these resources. Pages can currently be see via the Web at: <http://www.mhonarc.org/release/MHonArc/snapshot/doc/resources/charsetconverters.html> <http://www.mhonarc.org/release/MHonArc/snapshot/doc/resources/textencode.html> You may want to start with the TEXTENCODE page since it provides information on the differences and relationships of TEXTENCODE and CHARSETCONVERTERS. Side Note: You will notice that docs mention Unicode::MapUTF8, and MHonArc may use it depending on your Perl installation. However, I noticed conversion problems with Unicode::MapUTF8 when dealing with Japanese character data, i.e. it did nothing, but it did not complain. It may be that I did not install the Jcode module correctly, or Unicode::MapUTF8 is failing to recognize it. Therefore, if using versions of Perl < 5.8, and you have Unicode::MapUTF8 installed, run tests with Japanese messages. If you get problems, either use 5.8 (since the Encode module is available) or uninstall Unicode::MapUTF8 and let MHonArc use the fallback implementation for conversion to UTF-8. I am considering dropping support for Unicode::MapUTF8 since the Encode module supercedes it and is standard with Perl 5.8. Also, it appears that Unicode::MapUTF8 is not being actively maintained anymore. --ewh --------------------------------------------------------------------- To sign-off this list, send email to [EMAIL PROTECTED] with the message text UNSUBSCRIBE MHONARC-USERS
