[Bug 7314] New: Bayes.pm, DECOMPOSE_BODY_TOKENS and Unicode

bugzilla-daemon Wed, 27 Apr 2016 08:11:38 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7314


            Bug ID: 7314
           Summary: Bayes.pm, DECOMPOSE_BODY_TOKENS and Unicode
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Hardware: PC
                OS: FreeBSD
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Plugins
          Assignee: [email protected]
          Reporter: [email protected]

Created attachment 5387
  --> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5387&action=edit
suggested patch

Spamassassin fails to generate additional Bayes tokens "Foo", "foo!" and "foo"
from original token "Foo!" when the original token contains Unicode characters
from non-Latin languages. It happens because \w in regex in Bayes.pm matches
only Latin characters and numbers. As a consequence, for example, almost all
Cyrillic Unicode characters are deleted by s/[^\w:\*]//gs leading to empty
tokens or such weird things as "sk:" tokens.

This problem can be corrected by the attached patch. I have little experience
with Unicode in perl, so there can be better solution. The main idea is to make
\w match any Unicode word character, not just Latin, and to replace [A-Z] with
more generic [[:upper:]].

Maybe it is better to work with Unicode characters not just in
DECOMPOSE_BODY_TOKENS section, but everywhere in _tokenize_line sub. This idea
was also mentioned in the discussion of Bug 7130. Too many regex in
_tokenize_line sub are not working properly for non-Latin Unicode characters
now. For example splitting on "..." works only for Latin words, regex in
IGNORE_TITLE_CASE sections low-cases only A-Z capital letters and so on.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7314] New: Bayes.pm, DECOMPOSE_BODY_TOKENS and Unicode

Reply via email to