https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133
--- Comment #1 from Mark Martinec <[email protected]> --- Now for possible solutions. Carefully reading (and understanding) that HTML::Parser's warning: "Parsing of undecoded UTF-8 will give garbage when decoding entities" gives us a clue. There are exactly two correct options when processing HTML mail parts: 1) decode text (based on its declared encoding) into perl characters (Unicode) and pass that to HTML::Parser for HTML parsing. (A slightly modified MS::Message::_normalize() could do that.) The result would remain as perl characters (utf8 flag), with HTML entities properly decoded. This is the default mode of operation on HTML::Parser and this is how SpamAssassin calls it, except that it violates its assumption about perl characters, giving it encoded text (as UTF-8 or ISO-8859-1 or whatever is the original encoding). No wonder that HTML::Parser complains when it notices when it is given UTF-8 encoded octets (instead of perl characters). 2) another option is to stick to current processing as bytes in most parts of SpamAssassin. The HTML::Parser can also deal with a text given encoded as UTF-8 octets, but in this case the ->utf8_mode(1) needs to be set to let it know this is the case. The result remains encoded as UTF-8 octets, with HTML properly represented as UTF-8 octets. This was the solution as originally proposed by Sebastian Jaenicke in Bug 4046. This alternative is least disruptive to the rest of SpamAssassin, as it retains text as octets (no utf8 flag). The only drawback is when the original text is encoded in something other than UTF-8 (like in 32% of messages, according to Bug 7126). Fortunately there is a clean solution to that: turn on the normalize_charset option (see Bug 6945) - it would transcode any given encoding into UTF-8 encoding, so the rest of the processing (HTML decoding, rules, bayes) can rely on a single encoding, and still keep speed of processing octets (not perl characters). Seems to me the option #1 is the cleanest with good potential for the future, while the option #2 is least disruptive to existing code and its speed. Keeping status-quo (with no ->utf8_mode(1) and with suppressed warnings) seems the worst. -- You are receiving this mail because: You are the assignee for the bug.
