https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8271

            Bug ID: 8271
           Summary: bayes_stopword_ru in 60_bayes_stopwords.cf misses some
                    very common russian stopwords
           Product: Spamassassin
           Version: 4.0.1
          Hardware: PC
                OS: FreeBSD
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Rules
          Assignee: dev@spamassassin.apache.org
          Reporter: azo...@geolink-group.com
  Target Milestone: Undefined

The list of russian stopwords compiled into bayes_stopword_ru seems to be very
long but it misses some very common russian stopwords.

How to reproduce:

cat <<EOT >> stopwords.mbox
>From test  Mon Jul 22 20:42:19 2024
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

=D0=B5=D1=81=D0=BB=D0=B8 =D0=B8=D0=BB=D0=B8 =D0=BD=D0=B5
EOT

spamassassin -L -D bayes --mbox stopwords.mbox

Only the first stopword "если" is found:
dbg: bayes: skipped token '\x{D0}\x{B5}\x{D1}\x{81}\x{D0}\x{BB}\x{D0}\x{B8}'
because it's in stopword list for language 'ru'

Stopwords \x{D0}\x{B8}\x{D0}\x{BB}\x{D0}\x{B8} ("или" means "or" in english)
and \x{D0}\x{BD}\x{D0}\x{B5} ("не" means "not" in english) are missed.

The whole list of missed russian stopwords is "на|по|не|от|для|или|за|из|что".
It will be great to adjust 60_bayes_stopwords.cf by adding these stopwords to
bayes_stopword_ru rule.


P.S. Current version of bayes_stopword_ru from 60_bayes_stopwords.cf also
contains some low frequency russian words which cannot be considered as
stopwords. For example word "иногда" ("sometimes" in english) was found only in
7 of 8000 spam messages and in 50 from 40000 ham messages in my own corpus
(selected by hand).
Personally I use the following custom list of stopwords in my installation of
spamassassin:
bayes_stopword_ru
(?^:(на|по|не|от|для|или|за|Вас|из|что|если|будет|Вам|Если|мы|Здравствуйте|есть|это|можно|только|вас|нужно|без|его))
Each of these words had a Bayes score between 0.4 and 0.6 (for bayes db trained
on my corpus with no stopwords) and was found in at least 10% of messages from
my corpus.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to