[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

bugzilla-daemon Thu, 12 Feb 2015 07:38:47 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133


--- Comment #4 from Mark Martinec <[email protected]> ---
While experimenting with our HTML decoding, I noticed that our
test t/html_utf8.t expects that an html text would be decoded
into Unicode characters (not UTF-8 octets), otherwise it fails.

The t/html_utf8.t installs the following rule:

  body QUOTE_YOUR /\x{201c}Your/

and runs SpamAssassin on a file t/data/spam/009, which contains
the following HTML chunk:

  Click the &#8220;Your Account&#8221

(these entities represent double quotes: Click the “Your Account”)


Grepping through our current rules, I don't see a single case of a
Unicode character ( like \x{9999} or \N{U+9999} ) in any of rules,
although there are lots of single-bytes encoded as \x99. So our rules
do not seem to be expecting Unicode characters, just bytes, unlike
the t/html_utf8.t test.

Note this has nothing to do with a setting normalize_charset, it
is entirely an effect of HTML::Parser called with utf8_mode off
(which is a default and in effect in SpamAssassin so far).


So the question now is: is the test wrong in what it expects, i.e.
should the test rule be:  body QUOTE_YOUR /\xE2\x80\x9CYour/  instead?


The next logical question could be: in developing new rules (or updating
existing rules), what should be the expected representation of a plain
or HTML text as given to rules:

 - in Unicode characters
   (i.e. QP and Base64 decoded + character set decoded (Content-Type)
         + HTML decoded)

 - in UTF-8 octets
   (i.e. QP and Base64 decoded + character set decoded (Content-Type)
         + HTML decoded, then encoded into UTF-8 octets)
    seems this is a situation we are currently aiming at,
    with normalize_charset option enabled)

 - as octets of the original character set, mixed with Unicode or Latin-1
   entities from html parts
   (i.e. QP and Base64 decoded; html entities decoded into Unicode
    or Latin-1) - this is essentially our present situation with
    setting normalize_charset off)

 - in UTF-8 octets, mixed with Unicode or Latin-1 entities from html parts
   (i.e. QP and Base64 decoded + character set decoded (Content-Type);
    html entities decoded into Unicode or Latin-1) - this is essentially
    our present situation with setting normalize_charset on)

 - in UTF-8 octets, mixed with UTF-8 or Latin-1 entities from html parts
   (i.e. QP and Base64 decoded + character set decoded (Content-Type);
    html entities recoded into UTF-8 (utf8_mode on) or to Latin-1 (bug))

 - no decoding, original bytes
   (some rules seem to expect junk like QP encoded characters)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Reply via email to