https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7645

Henrik Krohns <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #6 from Henrik Krohns <[email protected]> ---
There's some utf8 rules, for example

(I've used "cat -v" to print them..)
body HS_BODY_899 /The seller hasnM-CM-"M-bM-^BM-,M-bM-^DM-"t provided any
postage details yet/
body HS_BODY_1575 /diesem Grund folgende Zahlung zu stornieren. Um den
dafM-CM-<r nM-CM-6tigen/

Basically the wide print error comes from outputting "scanner1.re", which ends
up containing

char *Mail_SpamAssassin_CompiledRegexps_body_0_scan1(unsigned char **p){
unsigned char *q = 1 + *p;
/*!re2c
        "diesem grund folgende zahlung zu stornieren"           
{RET("HS_BODY_1575,[l=1]");}
        "the seller hasnâ"            {RET("HS_BODY_899,[l=1]");}
  [\000-\377]        { return NULL; }
*/

Not sure if we should just print with binmode utf8 or similar, so the utf8
characters end up in scanner1.re, or perhaps convert them first to some hex
\xAB value. I guess this depends on what re2c is expecting.

I'm not sure what state utf8 rules/checks are in anyway. If there isn't
already, we should have some docs/bug describing all the steps from reading .cf
with utf8 rules to how the rule is stored and matched to decoded body (which
is, or is not utf8?).. and also how sa-compile fits in all of this..

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to