https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133
--- Comment #3 from Mark Martinec <[email protected]> --- > I'm more inclined for #2 at least for 3.4.X. So am I for the short term - keeping text encoded as octets (hopefully encoded as utf-8 in most cases) for the rest of processing. Meanwile I found an already documented bug in HTML::Parser : https://rt.cpan.org/Public/Bug/Display.html?id=99755 which also affects SpamAssassin if we chose to take advantage of the utf8_mode of parsing. Seems that some HTML entities are left as Latin-1 octets even if utf8_mode is on. The problem seems localized to individual HTML paragraphs with the right mix of content, and I'm seeing occasional fallout from this bug as junk bayes tokens, now that I'm calling HTML::Parser with utf8_mode enabled. It also seems that HTML::Parser hasn't been maintained lately, so it will probably take some time for the bug to get fixed. I was unable to find an easy workaround, apart from decoding text to Unicode first (utf8 flag on) and letting HTML::Parser work in Unicode - and decode the result to UTF-8 again after HTML parsing. As it happens the sub MS::Message::Node::rendered() contains both a call to _normalize() as well as to HTML->parse a little further down. It may be possible to avoid one unnecessary encoding/decoding pair and carry Unicode from _normalize() to HTML::Parser and do the encoding to UTF-8 only after HTML decoding. Will see how that approach would look like. -- You are receiving this mail because: You are the assignee for the bug.
