https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133
--- Comment #6 from Mark Martinec <[email protected]> --- (In reply to John Wilcock from comment #5) Thanks for the feedback! > Thinking about this purely from the rule-writing user's point of view, > totally ignoring the history and largely ignoring the underlying technical > details :-) I would want to be able to include any Unicode characters > directly in the rule file, and have it match the equivalent characters in > the message, regardless of Content-Type charset, and regardless of any > base64, quoted-printable and/or (for HTML message parts) &entity; encoding. > > So I should be able to write things like > > body CRAZY_EURO /€uro/ > header SUBJ_CREDIT_FR Subject =~ /crédit/ > > and match any occurrences of "€uro" or "crédit" regardless of what charset > the message was originally encoded in and whether entities were used. You can do it even now. For example I have the following rules (for localized spam/phishing): body L_WEBTEAM2_3 m{Ček vaš email hvala} body L_WEBTEAM2_4 m{Administrator je e-poštni sistem} body L_WEBTEAM2_5 m{Hvala za vaše sodelovanje pri zaščiti!} These are encoded as you see them here, i.e. UTF-8 encoded in a local.cf file, using a text editor (in UTF-8 locale). > This of course would imply that rule .cf files would need to be encoded > in UTF-8 (or whatever) and subjected to charset normalisation. Right. The above rules (yours or mine) work in the following cases: - if text in a mail message is already encoded in UTF-8 (after QP and Base64 decoding); - or if normalize_charset is enabled (even in 3.4.0, improved in trunk) and the original character set can be successfully decoded; It does not currently work for HTML entities which (based on HTML::Parser idiosyncrasy, its use in SpamAssassin, and its bugs) can end up as Unicode (wide character) or as Latin-1 - regardless of the original encoding of the text. > I guess that's a > whole new can of worms, but IMO it would make it far easier to address > international spam patterns. After all your efforts to normalise the > message, it would be a great shame to have to encode all non-ASCII > characters in rules, e.g. > > body CRAZY_EURO /\x{20AC}uro/ The above won't work, unless we decide to go for full decoding into Unicode. > though I would of course expect things to work if written that way. > It would be an even greater shame if rules had to be written as UTF-8 bytes > > body CRAZY_EURO /\xE2\x82\xACuro/ The /\xE2\x82\xACuro/ and /€uro/ are even now equivalent (assuming the encoding of a *.cf file is in UTF-8, which is common). > Next question: what effect (if any) would this have on rawbody rules? It shouldn't have any effect on rawbody rules, and neither on plugins which pull the raw ('pristine') text. -- You are receiving this mail because: You are the assignee for the bug.
