https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133
Mark Martinec <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|Undefined |3.4.1 --- Comment #8 from Mark Martinec <[email protected]> --- Bug 7133: Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities Sending lib/Mail/SpamAssassin/HTML.pm Sending lib/Mail/SpamAssassin/Message/Node.pm Sending lib/Mail/SpamAssassin/Message.pm Committed revision 1659641. This implements proposed changes: - to avoid HTML::Parser utf8_mode bug (rt#99755) provide input to the HTML parser as Unicode characters when possible (i.e. when normalize_charset is on); - if this is not possible (no charset decoding), then turn on utf8_mode setting in HTML::Parser, which will tell it to adopt byte semantics and not complain; if input text is encoded in US-ASCII or UTF-8 the result will be mostly correct (except for the module's 99755 bug); if input is other than UTF-8 the result will be a mix of encodings: original text will remain unchanged, HTML entities will be encoded as UTF-8 or Latin-1. It's not perfect, but is still an improvement over the present situation (e.g. it will not turn on utf8 flag in results). - the sub _normalize() is capable of returning either Unicode characters or UTF-8 octets. When we know we'll be parsing HTML we can require characters result, thus avoiding unnecessary encoding and decoding pair. - the test t/html_utf8.t is adapted as described earlier. In summary: whatever text comes out of these decoding steps (QP/Base64 decoding, Content-type charset, HTML decoding) will remain as bytes (utf8 flag off) and be given to rules and plugins as such. It will hopefully be encoded as UTF-8 octets (when such is the original encoding of a text, or when normalize_charset is enabled). A consequence: it is advised to turn on the normalize_charset setting. It is no longer as expensive as it could be (the result will be in bytes always), and gives predictable encoding to further processing. -- You are receiving this mail because: You are the assignee for the bug.
