[Bug 7141] New: Bayes truncates ('skip') long tokens on bytes, should it count characters instead?

bugzilla-daemon Thu, 19 Feb 2015 10:58:42 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7141


            Bug ID: 7141
           Summary: Bayes truncates ('skip') long tokens on bytes, should
                    it count characters instead?
           Product: Spamassassin
           Version: 3.4.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Plugins
          Assignee: [email protected]
          Reporter: [email protected]

Observing reported bayes tokens, some of these truncated (skip) tokens
stand out, e.g.:

  dbg: bayes: token 'sk:Раз�' => 0.986543689320388
  dbg: bayes: token 'sk:Бес�' => 0.986543689320388
  dbg: bayes: token 'sk:сбо�' => 0.986543689320388
  dbg: bayes: token 'sk:уча�' => 0.993172413793104
  dbg: bayes: token 'sk:Ува�' => 0.986543689320388
  dbg: bayes: token 'sk:сог�' => 0.0156699029126214

Seems like truncation to 7 bytes is unnatural for text in such
UTF-8 encoding which use mostly 2-byte encoded pairs (which goes
for alphabets like Cyrillic, Greek, Latin diacritics - but not CJK
or US-ASCII). The last character is chopped in the middle of its
2-byte sequence.

Should we increase truncation by one - to 8 bytes, which would
at least preserve most of the 2-byte characters in such alphabets.

Or maybe the limit of 7 should be interpreted as 7 *characters*
(not bytes).



Here is the current relevant code section from Plugin/Bayes.pm:


# How long a token should we hold onto?  (note: German speakers
# typically will require a longer token than English ones.)
use constant MAX_TOKEN_LENGTH => 15;

[...]

if (($region == 0 && HDRS_TOKENIZE_LONG_TOKENS_AS_SKIPS)
    || ($region == 1 && BODY_TOKENIZE_LONG_TOKENS_AS_SKIPS)
    || ($region == 2 && URIS_TOKENIZE_LONG_TOKENS_AS_SKIPS))
{
    # if (TOKENIZE_LONG_TOKENS_AS_SKIPS)
    # Spambayes trick via Matt: Just retain 7 chars.  Do not retain the
    # length, it does not help; see jm's mail to -devel on Nov 20 2002 at
    # http://sourceforge.net/p/spamassassin/mailman/message/12977605/
    # "sk:" stands for "skip".
  $token = "sk:".substr($token, 0, 7);
  }
}

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7141] New: Bayes truncates ('skip') long tokens on bytes, should it count characters instead?

Reply via email to