https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249
--- Comment #9 from Mark Martinec <[email protected]> --- Btw (for reference), Bug 7307 is related. (In reply to John Wilcock from comment #8) > (In reply to azotov from comment #4) > > This example was taken from real spam message which was created by some > > non-rfc-compliant software. > > Is there a mechanism in place for SA to not only work around such RFC > violations but also flag the fact that it has done so, because the violation > might be a spam sign worth scoring? > > If not, is it worth raising a new bug to record the idea? I'm using the following two rules to check for such violations: # RFC 2047 section 5: # Each 'encoded-word' MUST represent an integral number of characters. # A multi-octet character may not be split across adjacent 'encoded-word's. header L_SPLIT_UTF8_SUBJ Subject:raw =~ m{(=\?UTF-8) (?: \* [^?=<>, \t]* )? (\?Q\?) [^ ?]* =[89A-F][0-9A-F] \?= \s* \1 (?: \* [^ ?=]* )? \2 =[89AB][0-9A-F]}xsmi describe L_SPLIT_UTF8_SUBJ UTF-8 char split across QP encoded-words in Subject score L_SPLIT_UTF8_SUBJ 1.5 header L_SPLIT_UTF8_FROM From:raw =~ m{(=\?UTF-8) (?: \* [^?=<>, \t]* )? (\?Q\?) [^ ?]* =[89A-F][0-9A-F] \?= \s* \1 (?: \* [^ ?=]* )? \2 =[89AB][0-9A-F]}xsmi describe L_SPLIT_UTF8_FROM UTF-8 char split across QP encoded-words in From score L_SPLIT_UTF8_FROM 1.5 The L_SPLIT_UTF8_FROM hit only 4 times in the last three weeks (of 5 million messages processed at my site during that time), all in spam which already scored pretty high by other rules. The L_SPLIT_UTF8_SUBJ hit 62 times, almost all of which was spam. In our case the score of 1.5 seems to work fine. The hit rate might be higher in countries using multibyte character sets, depending on how poorly mail clients there (and bulk mail generating software) implement RFC 2047. -- You are receiving this mail because: You are the assignee for the bug.
