https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7141
Bug ID: 7141
Summary: Bayes truncates ('skip') long tokens on bytes, should
it count characters instead?
Product: Spamassassin
Version: 3.4.0
Hardware: All
OS: All
Status: NEW
Severity: enhancement
Priority: P5
Component: Plugins
Assignee: [email protected]
Reporter: [email protected]
Observing reported bayes tokens, some of these truncated (skip) tokens
stand out, e.g.:
dbg: bayes: token 'sk:Раз�' => 0.986543689320388
dbg: bayes: token 'sk:Бес�' => 0.986543689320388
dbg: bayes: token 'sk:сбо�' => 0.986543689320388
dbg: bayes: token 'sk:уча�' => 0.993172413793104
dbg: bayes: token 'sk:Ува�' => 0.986543689320388
dbg: bayes: token 'sk:сог�' => 0.0156699029126214
Seems like truncation to 7 bytes is unnatural for text in such
UTF-8 encoding which use mostly 2-byte encoded pairs (which goes
for alphabets like Cyrillic, Greek, Latin diacritics - but not CJK
or US-ASCII). The last character is chopped in the middle of its
2-byte sequence.
Should we increase truncation by one - to 8 bytes, which would
at least preserve most of the 2-byte characters in such alphabets.
Or maybe the limit of 7 should be interpreted as 7 *characters*
(not bytes).
Here is the current relevant code section from Plugin/Bayes.pm:
# How long a token should we hold onto? (note: German speakers
# typically will require a longer token than English ones.)
use constant MAX_TOKEN_LENGTH => 15;
[...]
if (($region == 0 && HDRS_TOKENIZE_LONG_TOKENS_AS_SKIPS)
|| ($region == 1 && BODY_TOKENIZE_LONG_TOKENS_AS_SKIPS)
|| ($region == 2 && URIS_TOKENIZE_LONG_TOKENS_AS_SKIPS))
{
# if (TOKENIZE_LONG_TOKENS_AS_SKIPS)
# Spambayes trick via Matt: Just retain 7 chars. Do not retain the
# length, it does not help; see jm's mail to -devel on Nov 20 2002 at
# http://sourceforge.net/p/spamassassin/mailman/message/12977605/
# "sk:" stands for "skip".
$token = "sk:".substr($token, 0, 7);
}
}
--
You are receiving this mail because:
You are the assignee for the bug.