[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

bugzilla-daemon Fri, 13 Feb 2015 10:51:03 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133


Mark Martinec <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.4.1

--- Comment #8 from Mark Martinec <[email protected]> ---
Bug 7133: Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will
give garbage when decoding entities
  Sending lib/Mail/SpamAssassin/HTML.pm
  Sending lib/Mail/SpamAssassin/Message/Node.pm
  Sending lib/Mail/SpamAssassin/Message.pm
Committed revision 1659641.


This implements proposed changes:

- to avoid HTML::Parser utf8_mode bug (rt#99755) provide input to
  the HTML parser as Unicode characters when possible (i.e. when
  normalize_charset is on);

- if this is not possible (no charset decoding), then turn on utf8_mode
  setting in HTML::Parser, which will tell it to adopt byte semantics
  and not complain; if input text is encoded in US-ASCII or UTF-8
  the result will be mostly correct (except for the module's 99755 bug);
  if input is other than UTF-8 the result will be a mix of encodings:
  original text will remain unchanged, HTML entities will be encoded
  as UTF-8 or Latin-1. It's not perfect, but is still an improvement over
  the present situation (e.g. it will not turn on utf8 flag in results).

- the sub _normalize() is capable of returning either Unicode characters
  or UTF-8 octets. When we know we'll be parsing HTML we can require
  characters result, thus avoiding unnecessary encoding and decoding
  pair.

- the test t/html_utf8.t is adapted as described earlier.


In summary: whatever text comes out of these decoding steps
(QP/Base64 decoding, Content-type charset, HTML decoding) will remain
as bytes (utf8 flag off) and be given to rules and plugins as such.
It will hopefully be encoded as UTF-8 octets (when such is the
original encoding of a text, or when normalize_charset is enabled).

A consequence: it is advised to turn on the normalize_charset setting.
It is no longer as expensive as it could be (the result will be in
bytes always), and gives predictable encoding to further processing.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Reply via email to