[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

bugzilla-daemon Thu, 12 Feb 2015 08:51:41 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133


--- Comment #6 from Mark Martinec <[email protected]> ---
(In reply to John Wilcock from comment #5)
Thanks for the feedback!

> Thinking about this purely from the rule-writing user's point of view,
> totally ignoring the history and largely ignoring the underlying technical
> details :-) I would want to be able to include any Unicode characters
> directly in the rule file, and have it match the equivalent characters in
> the message, regardless of Content-Type charset, and regardless of any
> base64, quoted-printable and/or (for HTML message parts) &entity; encoding. 
> 
> So I should be able to write things like
> 
> body CRAZY_EURO /€uro/
> header SUBJ_CREDIT_FR Subject =~ /crédit/
> 
> and match any occurrences of "€uro" or "crédit" regardless of what charset
> the message was originally encoded in and whether entities were used.

You can do it even now. For example I have the following rules (for
localized spam/phishing):

  body L_WEBTEAM2_3  m{Ček vaš email hvala}
  body L_WEBTEAM2_4  m{Administrator je e-poštni sistem}
  body L_WEBTEAM2_5  m{Hvala za vaše sodelovanje pri zaščiti!}

These are encoded as you see them here, i.e. UTF-8 encoded in
a local.cf file, using a text editor (in UTF-8 locale).

> This of course would imply that rule .cf files would need to be encoded
> in UTF-8 (or whatever) and subjected to charset normalisation.

Right. The above rules (yours or mine) work in the following cases:

 - if text in a mail message is already encoded in UTF-8 (after QP and
   Base64 decoding);

 - or if normalize_charset is enabled (even in 3.4.0, improved in trunk)
   and the original character set can be successfully decoded;

It does not currently work for HTML entities which (based on
HTML::Parser idiosyncrasy, its use in SpamAssassin, and its bugs)
can end up as Unicode (wide character) or as Latin-1 - regardless
of the original encoding of the text.


> I guess that's a
> whole new can of worms, but IMO it would make it far easier to address
> international spam patterns. After all your efforts to normalise the
> message, it would be a great shame to have to encode all non-ASCII
> characters in rules, e.g. 
> 
> body CRAZY_EURO /\x{20AC}uro/

The above won't work, unless we decide to go for full decoding into Unicode.

> though I would of course expect things to work if written that way. 
> It would be an even greater shame if rules had to be written as UTF-8 bytes
> 
> body CRAZY_EURO /\xE2\x82\xACuro/

The /\xE2\x82\xACuro/  and  /€uro/  are even now equivalent
(assuming the encoding of a *.cf file is in UTF-8, which is common).


> Next question: what effect (if any) would this have on rawbody rules?

It shouldn't have any effect on rawbody rules, and neither on plugins
which pull the raw ('pristine') text.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Reply via email to