https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7141
--- Comment #2 from Mark Martinec <[email protected]> --- (In reply to RW from comment #1) > Does this still have a point? I'm wondering if it was simply implemented at > a time when tokens weren't hashed to a constant size. That was my thought too - I don't really know. Although looking at what gets 'skipped' (like fragments of long e-mail addresses or URLs, long numbers, some gibberish, some really long words), perhaps this stemming approximation does have some merit. It may be worth finding out why it was introduced back in 2002 or thereabout. Regardless of whether we disable it eventually or not, here is what could be used meanwhile: --- lib/Mail/SpamAssassin/Plugin/Bayes.pm (revision 1661153) +++ lib/Mail/SpamAssassin/Plugin/Bayes.pm (working copy) @@ -1240,7 +1241,14 @@ # length, it does not help; see jm's mail to -devel on Nov 20 2002 at # http://sourceforge.net/p/spamassassin/mailman/message/12977605/ # "sk:" stands for "skip". - $token = "sk:".substr($token, 0, 7); + # Bug 7141: retain seven UTF-8 chars (or other bytes), + # if followed by at least two bytes + $token =~ s{ ^ ( (?> (?: [\x00-\x7F\xF5-\xFF] | + [\xC0-\xDF][\x80-\xBF] | + [\xE0-\xEF][\x80-\xBF]{2} | + [\xF0-\xF4][\x80-\xBF]{3} | . ){7} )) + .{2,} \z }{sk:$1}xs; + ## (was:) $token = "sk:".substr($token, 0, 7); # seven bytes } } As this code section is not entered often, additional complexity does not matter much here. -- You are receiving this mail because: You are the assignee for the bug.
