https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133
--- Comment #4 from Mark Martinec <[email protected]> --- While experimenting with our HTML decoding, I noticed that our test t/html_utf8.t expects that an html text would be decoded into Unicode characters (not UTF-8 octets), otherwise it fails. The t/html_utf8.t installs the following rule: body QUOTE_YOUR /\x{201c}Your/ and runs SpamAssassin on a file t/data/spam/009, which contains the following HTML chunk: Click the “Your Account” (these entities represent double quotes: Click the “Your Account”) Grepping through our current rules, I don't see a single case of a Unicode character ( like \x{9999} or \N{U+9999} ) in any of rules, although there are lots of single-bytes encoded as \x99. So our rules do not seem to be expecting Unicode characters, just bytes, unlike the t/html_utf8.t test. Note this has nothing to do with a setting normalize_charset, it is entirely an effect of HTML::Parser called with utf8_mode off (which is a default and in effect in SpamAssassin so far). So the question now is: is the test wrong in what it expects, i.e. should the test rule be: body QUOTE_YOUR /\xE2\x80\x9CYour/ instead? The next logical question could be: in developing new rules (or updating existing rules), what should be the expected representation of a plain or HTML text as given to rules: - in Unicode characters (i.e. QP and Base64 decoded + character set decoded (Content-Type) + HTML decoded) - in UTF-8 octets (i.e. QP and Base64 decoded + character set decoded (Content-Type) + HTML decoded, then encoded into UTF-8 octets) seems this is a situation we are currently aiming at, with normalize_charset option enabled) - as octets of the original character set, mixed with Unicode or Latin-1 entities from html parts (i.e. QP and Base64 decoded; html entities decoded into Unicode or Latin-1) - this is essentially our present situation with setting normalize_charset off) - in UTF-8 octets, mixed with Unicode or Latin-1 entities from html parts (i.e. QP and Base64 decoded + character set decoded (Content-Type); html entities decoded into Unicode or Latin-1) - this is essentially our present situation with setting normalize_charset on) - in UTF-8 octets, mixed with UTF-8 or Latin-1 entities from html parts (i.e. QP and Base64 decoded + character set decoded (Content-Type); html entities recoded into UTF-8 (utf8_mode on) or to Latin-1 (bug)) - no decoding, original bytes (some rules seem to expect junk like QP encoded characters) -- You are receiving this mail because: You are the assignee for the bug.
