[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

bugzilla-daemon Thu, 12 Feb 2015 10:31:40 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133


--- Comment #7 from Mark Martinec <[email protected]> ---
> body CRAZY_EURO /€uro/
> header SUBJ_CREDIT_FR Subject =~ /crédit/
>
> The /\xE2\x82\xACuro/  and  /€uro/  are even now equivalent
> (assuming the encoding of a *.cf file is in UTF-8, which is common).


There is a slight gotcha there when writing body and header rules in
UTF-8.  The above cases are fine, SpamAssassin just sees a sequence
of octets (UTF-8) and compares them to a sequence of octets in a text.

The gotcha is when it would be desirable to include such non-ASCII
character in a bracketed character class, e.g. [uµùúûü]. This would
only work if both a text and regexp from rules is represented in
Unicode (or as some single-byte encoding), i.e. each logical character
as a indivisible entity in a character class, not as individual octets.
Even nastier is a range, e.g. [uµù-ü], which assumes Latin-1 encoding
and has no sensible equivalence in Unicode.

We have a couple of such cases currently in our ruleset, e.g. in
20_drugs.cf and 25_replace.cf.  When converting such rules into
UTF-8, a character class like [uµùúûü] would need to be converted
to something like (u|µ|ù|ú|û|ü).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Reply via email to