https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7144

            Bug ID: 7144
           Summary: To normalize_charset or not to normalize_charset, that
                    is the question.
           Product: Spamassassin
           Version: 3.4.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Rules
          Assignee: [email protected]
          Reporter: [email protected]

To normalize_charset or not to normalize_charset, that is the question.

Isolating this decision from other problem reports (Bug 7126,
Bug 7133) so that they may be closed as they do not introduce
incompatibilities between 3.4.0 and 3.4.1 by themselves.

The documentation on normalize_charset now states:

normalize_charset ( 0 | 1)        (default: 0)
  Whether to decode non- UTF-8 and non-ASCII textual parts and recode
  them to UTF-8 before the text is given over to rules processing.
  The character set used for attempted decoding is primarily based on
  a declared character set in a Content-Type header, but if the
  decoding attempt fails a module Encode::Detect::Detector is
  consulted (if available) to provide a guess based on the actual
  text, and decoding is re-attempted. Even if the option is enabled
  no unnecessary decoding and re-encoding work is done when possible
  (like with an all-ASCII text with a US-ASCII or extended ASCII
  character set declaration, e.g. UTF-8 or ISO-8859-nn or Windows-nnnn).

  Unicode support in old versions of perl or in a core module Encode
  is likely to be buggy in places, so if the normalize_charset
  function is enabled it is advised to stick to more recent versions
  of perl (preferably 5.12 or later). The module
  Encode::Detect::Detector is optional, when necessary it will be
  used if it is available.


The final result is unchanged from what we had in 3.4.0 and earlier,
i.e. when normalize_charset is enabled the resulting decoded text
(as passed to rules and plugins) is transcoded if necessary from
the original character set encoding to UTF-8 octets.
When normalize_charset is disabled the original encoding is retained,
which may or may not be UTF-8 octets.

(There was just one exception/bug to the above claim, now fixed in
Bug 7133, where some unfortunate HTML text could end up as Unicode
perl characters (utf8 flag on) and could slow down rules. This is
now fixed, no Unicode perl characters should be reaching rules any
longer, regardless of original encoding or HTML.)

Now then, seems we have a couple of rules (mostly in 25_replace.cf
and 20_drugs.cf), which assume to be given text in Latin-1 or
Windows-1252 encoding). Such rules are only effective when
normalize_charset is off. If these rules were to be effective
when given text encoded as UTF-8, they would need to be modified.

Some of these modifications is trivial (just replacing Latin-1 char
in a regexp with UTF-8 byte sequence). Unfortunately then Latin-1
characters are used in a perl regexp character class (brackets),
there is no automatic fix, as single bytes would need to be replaced
by multiple bytes (byte pairs for Latin characters), so a character
class approach no longer works, a alternation in a regexp is needed.
The following problem description is copied from Bug 7133#c7

>> body CRAZY_EURO /€uro/
>> header SUBJ_CREDIT_FR Subject =~ /crédit/
>>
>> The /\xE2\x82\xACuro/  and  /€uro/  are even now equivalent
>> (assuming the encoding of a *.cf file is in UTF-8, which is common).
>
> There is a slight gotcha there when writing body and header rules in
> UTF-8.  The above cases are fine, SpamAssassin just sees a sequence
> of octets (UTF-8) and compares them to a sequence of octets in a text.
>
> The gotcha is when it would be desirable to include such non-ASCII
> character in a bracketed character class, e.g. [uµùúûü]. This would
> only work if both a text and regexp from rules is represented in
> Unicode (or as some single-byte encoding), i.e. each logical character
> as a indivisible entity in a character class, not as individual octets.
> Even nastier is a range, e.g. [uµù-ü], which assumes Latin-1 encoding
> and has no sensible equivalence in Unicode.
>
> We have a couple of such cases currently in our ruleset, e.g. in
> 20_drugs.cf and 25_replace.cf.  When converting such rules into
> UTF-8, a character class like [uµùúûü] would need to be converted
> to something like (u|µ|ù|ú|û|ü).


Note that this is not a new issue, there are no incompatibilities
between 3.4.0 and 3.4.1 in this regard introduced by related
Bug 7126, Bug 7130, and Bug 7141). We had this same problem even
with 3.4.0 and earlier, it's just that normalize_charset wasn't
very popular, and UTF-8 encoding was not as widespread ten years ago
as it is now (see some statistics at Bug 7126#c6).


So to summarize: turning on normalize_charset seems like a useful
feature as it can simplify/unify some rules (these may even be written
by a normal text editor in UTF-8 locale, no need for cyptic \xHH in
regexp, nor need not be concerned about different original character
sets and their encodings in a mail). Unfortunately some of the existing
rules which assume original encoding become ineffective (and could
even potentially misfire).

Eventually (in future) the clean solution would be to work entirely
in Unicode domain (perl characters). Seems we are not there yet
due to slowdown in regexp evaluation and due to a need to support
ancient perl versions.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to