https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133
--- Comment #7 from Mark Martinec <[email protected]> --- > body CRAZY_EURO /€uro/ > header SUBJ_CREDIT_FR Subject =~ /crédit/ > > The /\xE2\x82\xACuro/ and /€uro/ are even now equivalent > (assuming the encoding of a *.cf file is in UTF-8, which is common). There is a slight gotcha there when writing body and header rules in UTF-8. The above cases are fine, SpamAssassin just sees a sequence of octets (UTF-8) and compares them to a sequence of octets in a text. The gotcha is when it would be desirable to include such non-ASCII character in a bracketed character class, e.g. [uµùúûü]. This would only work if both a text and regexp from rules is represented in Unicode (or as some single-byte encoding), i.e. each logical character as a indivisible entity in a character class, not as individual octets. Even nastier is a range, e.g. [uµù-ü], which assumes Latin-1 encoding and has no sensible equivalence in Unicode. We have a couple of such cases currently in our ruleset, e.g. in 20_drugs.cf and 25_replace.cf. When converting such rules into UTF-8, a character class like [uµùúûü] would need to be converted to something like (u|µ|ù|ú|û|ü). -- You are receiving this mail because: You are the assignee for the bug.
