This mail is an automated notification from the bugs tracker of the project: MHonArc.
/**************************************************************************/ [bugs #11187] Latest Modifications: Changes by: Earl Hood <[EMAIL PROTECTED]> 'Date: Fri 12/03/2004 at 20:41 (US/Central) What | Removed | Added --------------------------------------------------------------------------- Resolution | None | Fixed Fixed Release | | CVS ------------------ Additional Follow-up Comments ---------------------------- Fix checked into CVS. /**************************************************************************/ [bugs #11187] Full Item Snapshot: URL: <http://savannah.nongnu.org/bugs/?func=detailitem&item_id=11187> Project: MHonArc Submitted by: Egmont Koblinger On: Thu 12/02/2004 at 00:04 Category: Character Sets Severity: 5 - Average Item Group: Incorrect Behavior Resolution: Fixed Privacy: Public Assigned to: None Status: Open Platform Version: Linux Perl Version: 5.8.5 Component Version: 2.6.10 Fixed Release: CVS Summary: incorrectly parsing UTF-8 encoded messages Original Submission: I use mhonarc without any configuration file, just simply the command "mhonarc -outdir outdir indir" whereas "indir" only contains one file with one single message encoded in UTF-8. (Both the subject and the body contain UTF-8 encoded accented letters, the subject uses quoted-printable, the body's transfer encoding is 8-bit). The output html files are quite strange. For each UTF-8 byte sequence only the first byte is taken into account and it is converted to a html escape. For example, the Euro sign (U+20AC, UTF-8: E2 82 AC) will appear in the html output as "&#E2;" and then 82 and AC are skipped, processing goes on with the next Unicode character. In MHonarc/CharEnt.pm line 153 there's a switch to check whether perl is new enough to support UTF-8. If it isn't, then manual processing of UTF-8 character takes place. Forcing the "non-UTF-8-aware perl" branch of the "if" statement (that is, changing the "if ($] >= 5.006)" to "if (0)" repairs the problem, in this case the output will be the expected "AC;". I don't think it matters, but I have LANG=hu_HU (latin2 locale) and no other LC_* variables set. However, UTF-8 locales are also available on my system. Follow-up Comments ------------------ ------------------------------------------------------- Date: Fri 12/03/2004 at 20:41 By: Earl Hood <ehood> Fix checked into CVS. ------------------------------------------------------- Date: Fri 12/03/2004 at 20:11 By: 0 <None> Yes, your patch is definitely nicer than my one. I told you I'm beginner in perl :-) Thanks for the fix! ------------------------------------------------------- Date: Fri 12/03/2004 at 18:42 By: Earl Hood <ehood> The sample patch provided is not applicable for 5.6.x since the Encode module is only available for 5.8.x and later. After some searching, it appears that adding the "U0" specifier to unpack makes things work. I do not know fully understand why unpack requires this to get things to work, but it appears to fix the problem. I've attached a patch to this report. It will be checked into CVS after I can resolve some connectivity problems with the CVS server. ------------------------------------------------------- Date: Thu 12/02/2004 at 19:28 By: 0 <None> Sample patch follows that fixes the problem. It's just a case study to show what the problem is, depending on the Encode module may not be nice and I have no idea whether it's supported in older perls. (Note that I'm absolute beginner in perl.) The problem is that when unpack is executed in line 159 (according to the original 2.6.10 source) then its parameter ($1) is just a sequence of bytes and perl has no idea that it should be interpreted as utf8. Hence I guess it interprets it according to latin1 and that's why unpack doesn't do what we need. Before using unpack we have to tell perl "hey that's an utf8 string". ------------------------------------------------------- Date: Thu 12/02/2004 at 00:58 By: Egmont Koblinger <egmont> I attach a test case. This doesn't only happen for one particular message but rather for every message I write with mutt using UTF-8 encoding so it's not a problem to generate a publicly visible test case. Both the subject and the body contain the following string: "asdf" then "e acute" (both latin1 and 2) then "e grave" (only latin1) then "o doubleacute" (only latin2) then an euro sign (neither latin1 nor latin2) followed by "jkl;". The input directory contains the message. The output-actual directory was generated with mainstream mhonarc 2.6.10 using "mhonarc -outdir output-actual input". Similarly output-expected was generated with mhonarc patched as described above. All this packed into a single tarball. ------------------------------------------------------- Date: Thu 12/02/2004 at 00:36 By: Earl Hood <ehood> Can submitter please zip up sample message and send it to the author's address for evaluation? Or you can attach the bundle to this bug report if it is okay that the email message is readable by the public. Please also provide sample correct and incorrect conversion of the message. File Attachments ------------------- ------------------------------------------------------- Date: Fri 12/03/2004 at 18:42 Name: mhonarc-utf8-CharEnt.patch Size: 346B By: ehood UTF-8 to entity ref patch that works for Perl 5.6.x and 5.8.x http://savannah.nongnu.org/bugs/download.php?item_id=11187&item_file_id=1938 ------------------------------------------------------- Date: Thu 12/02/2004 at 19:28 Name: mhonarc-utf8.patch Size: 516B By: None sample fix http://savannah.nongnu.org/bugs/download.php?item_id=11187&item_file_id=1936 ------------------------------------------------------- Date: Thu 12/02/2004 at 00:58 Name: mhonarc-utf8.tar.gz Size: 2.65KB By: egmont example http://savannah.nongnu.org/bugs/download.php?item_id=11187&item_file_id=1933 For detailed info, follow this link: <http://savannah.nongnu.org/bugs/?func=detailitem&item_id=11187> _______________________________________________ Message sent via/by Savannah http://savannah.nongnu.org/ --------------------------------------------------------------------- To sign-off this list, send email to [EMAIL PROTECTED] with the message text UNSUBSCRIBE MHONARC-DEV