[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

bugzilla-daemon Tue, 10 Feb 2015 08:50:30 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133


--- Comment #3 from Mark Martinec <[email protected]> ---
> I'm more inclined for #2 at least for 3.4.X.

So am I for the short term - keeping text encoded as octets (hopefully
encoded as utf-8 in most cases) for the rest of processing.


Meanwile I found an already documented bug in HTML::Parser :

  https://rt.cpan.org/Public/Bug/Display.html?id=99755

which also affects SpamAssassin if we chose to take advantage of
the utf8_mode of parsing. Seems that some HTML entities are left as
Latin-1 octets even if utf8_mode is on. The problem seems localized
to individual HTML paragraphs with the right mix of content,
and I'm seeing occasional fallout from this bug as junk bayes
tokens, now that I'm calling HTML::Parser with utf8_mode enabled.

It also seems that HTML::Parser hasn't been maintained lately,
so it will probably take some time for the bug to get fixed.

I was unable to find an easy workaround, apart from decoding text
to Unicode first (utf8 flag on) and letting HTML::Parser work in
Unicode - and decode the result to UTF-8 again after HTML parsing.

As it happens the sub MS::Message::Node::rendered() contains both
a call to _normalize() as well as to HTML->parse a little further
down. It may be possible to avoid one unnecessary encoding/decoding
pair and carry Unicode from _normalize() to HTML::Parser and do
the encoding to UTF-8 only after HTML decoding. Will see how
that approach would look like.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Reply via email to