[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

bugzilla-daemon Thu, 05 Feb 2015 19:25:19 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133


--- Comment #1 from Mark Martinec <[email protected]> ---
Now for possible solutions. Carefully reading (and understanding) that
HTML::Parser's warning:
  "Parsing of undecoded UTF-8 will give garbage when decoding entities"
gives us a clue.

There are exactly two correct options when processing HTML mail parts:


1) decode text (based on its declared encoding) into perl characters
(Unicode) and pass that to HTML::Parser for HTML parsing. (A slightly
modified MS::Message::_normalize() could do that.) The result would
remain as perl characters (utf8 flag), with HTML entities properly
decoded.

This is the default mode of operation on HTML::Parser and this is how
SpamAssassin calls it, except that it violates its assumption about
perl characters, giving it encoded text (as UTF-8 or ISO-8859-1 or
whatever is the original encoding). No wonder that HTML::Parser
complains when it notices when it is given UTF-8 encoded octets
(instead of perl characters).


2) another option is to stick to current processing as bytes in most
parts of SpamAssassin. The HTML::Parser can also deal with a text
given encoded as UTF-8 octets, but in this case the ->utf8_mode(1)
needs to be set to let it know this is the case. The result remains
encoded as UTF-8 octets, with HTML properly represented as UTF-8
octets. This was the solution as originally proposed by
Sebastian Jaenicke in Bug 4046.

This alternative is least disruptive to the rest of SpamAssassin,
as it retains text as octets (no utf8 flag). The only drawback is
when the original text is encoded in something other than UTF-8
(like in 32% of messages, according to Bug 7126). Fortunately there
is a clean solution to that: turn on the normalize_charset option
(see Bug 6945) - it would transcode any given encoding into UTF-8
encoding, so the rest of the processing (HTML decoding, rules, bayes)
can rely on a single encoding, and still keep speed of processing
octets (not perl characters).



Seems to me the option #1 is the cleanest with good potential for
the future, while the option #2 is least disruptive to existing code
and its speed. Keeping status-quo (with no ->utf8_mode(1) and with
suppressed warnings) seems the worst.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Reply via email to