[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

bugzilla-daemon Thu, 12 Feb 2015 08:30:57 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133


--- Comment #5 from John Wilcock <[email protected]> ---
Firstly, thanks for all the good work so far on this. 

Thinking about this purely from the rule-writing user's point of view, totally
ignoring the history and largely ignoring the underlying technical details :-)
I would want to be able to include any Unicode characters directly in the rule
file, and have it match the equivalent characters in the message, regardless of
Content-Type charset, and regardless of any base64, quoted-printable and/or
(for HTML message parts) &entity; encoding. 

So I should be able to write things like

body CRAZY_EURO /€uro/
header SUBJ_CREDIT_FR Subject =~ /crédit/

and match any occurrences of "€uro" or "crédit" regardless of what charset the
message was originally encoded in and whether entities were used. 

This of course would imply that rule .cf files would need to be encoded in
UTF-8 (or whatever) and subjected to charset normalisation. I guess that's a
whole new can of worms, but IMO it would make it far easier to address
international spam patterns. After all your efforts to normalise the message,
it would be a great shame to have to encode all non-ASCII characters in rules,
e.g. 

body CRAZY_EURO /\x{20AC}uro/

though I would of course expect things to work if written that way. 
It would be an even greater shame if rules had to be written as UTF-8 bytes

body CRAZY_EURO /\xE2\x82\xACuro/

Next question: what effect (if any) would this have on rawbody rules?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Reply via email to