https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249

--- Comment #10 from [email protected] ---
> In our case the score of 1.5 seems to work fine. The hit rate might
> be higher in countries using multibyte character sets, depending
> on how poorly mail clients there (and bulk mail generating software)
> implement RFC 2047.

I have tested L_SPLIT_UTF8_SUBJ and L_SPLIT_UTF8_FROM on my corpus containing
mostly messages with cyrillic characters (34460 spam messages and 20354 ham).
~90% of messages have MIME-encoded subjects and ~50% have MIME-encoded From
(but most of them have windows-1251 and koi8r encoding, UTF8 have only <5%
messages).

The L_SPLIT_UTF8_SUBJ got 3 hits in ham messages and 50 hits in spam, nice
result.
The L_SPLIT_UTF8_FROM got 46 hits in ham and 19 hits in spam, too many false
positives. False positives were created by badly written php mail robots and
bulk mail software sending ham messages.

So, L_SPLIT_UTF8_SUBJ seems to be good addition for cyrillic spam, at least for
my mail flow.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to