https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8271
Bug ID: 8271 Summary: bayes_stopword_ru in 60_bayes_stopwords.cf misses some very common russian stopwords Product: Spamassassin Version: 4.0.1 Hardware: PC OS: FreeBSD Status: NEW Severity: enhancement Priority: P2 Component: Rules Assignee: dev@spamassassin.apache.org Reporter: azo...@geolink-group.com Target Milestone: Undefined The list of russian stopwords compiled into bayes_stopword_ru seems to be very long but it misses some very common russian stopwords. How to reproduce: cat <<EOT >> stopwords.mbox >From test Mon Jul 22 20:42:19 2024 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable =D0=B5=D1=81=D0=BB=D0=B8 =D0=B8=D0=BB=D0=B8 =D0=BD=D0=B5 EOT spamassassin -L -D bayes --mbox stopwords.mbox Only the first stopword "если" is found: dbg: bayes: skipped token '\x{D0}\x{B5}\x{D1}\x{81}\x{D0}\x{BB}\x{D0}\x{B8}' because it's in stopword list for language 'ru' Stopwords \x{D0}\x{B8}\x{D0}\x{BB}\x{D0}\x{B8} ("или" means "or" in english) and \x{D0}\x{BD}\x{D0}\x{B5} ("не" means "not" in english) are missed. The whole list of missed russian stopwords is "на|по|не|от|для|или|за|из|что". It will be great to adjust 60_bayes_stopwords.cf by adding these stopwords to bayes_stopword_ru rule. P.S. Current version of bayes_stopword_ru from 60_bayes_stopwords.cf also contains some low frequency russian words which cannot be considered as stopwords. For example word "иногда" ("sometimes" in english) was found only in 7 of 8000 spam messages and in 50 from 40000 ham messages in my own corpus (selected by hand). Personally I use the following custom list of stopwords in my installation of spamassassin: bayes_stopword_ru (?^:(на|по|не|от|для|или|за|Вас|из|что|если|будет|Вам|Если|мы|Здравствуйте|есть|это|можно|только|вас|нужно|без|его)) Each of these words had a Bayes score between 0.4 and 0.6 (for bayes db trained on my corpus with no stopwords) and was found in at least 10% of messages from my corpus. -- You are receiving this mail because: You are the assignee for the bug.