https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7135

            Bug ID: 7135
           Summary: Bayes tokenizer 'arbitrarily' breaks multibyte CJK
                    utf-8 characters into digrams instead of breaking on
                    UTF-8 character boundaries
           Product: Spamassassin
           Version: 3.4.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Plugins
          Assignee: [email protected]
          Reporter: [email protected]

Observing the 'bayes: token' debug logging on mail messages written
in far-Eastern character sets, the log often reports a multitude of
entries like:

  bayes: token '8:i�' => 0.00795...

The code section in Bayes.pm that does this is:

  if (TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES && $token =~ /[\xa0-\xff]{2}/) {
    # Matt sez: "Could be asian? Autrijus suggested doing character ngrams,
    # but I'm doing tuples to keep the dbs small(er)."  Sounds like a plan
    # to me! (jm)
    while ($token =~ s/^(..?)//) {
      push (@rettokens, "8:$1");
    }
    next;
  }

So it seems that 3- or 4-byte UTF-8 sequences representing
characters like CJK or special punctuation are just 'arbitrarily'
chopped in pairs regardless of boundaries between characters.
So for example the last octet of a previous character can form
a pair with the first octet of the next character. Or an arbitrary
pair of subsequent octets (a substring) of a 3- or 4-byte UTF-8
encoding of a single character is considered a token.

This seems far from ideal. It's like taking pairs of bytes from
Base64 encoding and hoping to get a good representation of the
original encoded message.

So I'm suggesting to add the following code section just before
the code section mentioned above:

  if (TOKENIZE_LONG_8BIT_SEQS_AS_UTF8_CHARS && $token =~ /[\x80-\xBF]{2}/) {
    # only collect 3- and 4-byte UTF-8 sequences, ignore 2-byte sequences
    my(@t) = $token =~ /( (?: [\xE0-\xEF] | [\xF0-\xF4][\x80-\xBF] )
                          [\x80-\xBF]{2} )/xsg;
    if (@t) {
      push (@rettokens, map('u8:'.$_, @t));
      next;
    }
  }

It only collects valid 3- or 4-octet UTF-8 characters from long
tokens containing 8-bit characters - very much like the original
code section does, but observes character boundaries.
This covers characters from CJK character sets, punctuation
characters, Euro symbol, etc, but does not trigger on Western
character sets which are mostly represented a 2-byte UTF-8
sequences.

If there are no valid long UTF-8 bytes sequences found, it falls
back to existing code which just chops string into byte pairs.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to