https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8271

azo...@geolink-group.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |azo...@geolink-group.com

--- Comment #2 from azo...@geolink-group.com ---
Yes, I understand that it is difficult to generate stopwords for a language you
don't know.

Some short russian stopwords (such as "не") could be skipped by your python
script because they have 2 characters accoring to Unicode-aware python but
actually consist of 4 bytes in Unicode encoding. The check "next if $len < 3;"
in sub _tokenize_line in Bayes.pm does not skip such tokens because their
length is 4 bytes. Such words (2 characters encoded with 4 bytes) should be
included in stopwords lists in any language despite looking too short at fist
sight.

You can use my russian stoplist for spamassassin rules.
It's human-readable form is:
bayes_stopword_ru
(?^:(на|по|не|от|для|или|за|Вас|из|что|если|будет|Вам|Если|мы|Здравствуйте|есть|это|можно|только|вас|нужно|без|его))
Optimized regexp:
bayes_stopword_ru
(?^:(?:\xd0(?:\x92\xd0\xb0(?:\xd0\xbc|\xd1\x81)|\x95\xd1\x81\xd0\xbb\xd0\xb8|\x97\xd0\xb4\xd1\x80\xd0\xb0\xd0\xb2\xd1\x81\xd1\x82\xd0\xb2\xd1\x83\xd0\xb9\xd1\x82\xd0\xb5|\xb1(?:\xd0\xb5\xd0\xb7|\xd1\x83\xd0\xb4\xd0\xb5\xd1\x82)|\xb2\xd0\xb0\xd1\x81|\xb4\xd0\xbb\xd1\x8f|\xb5(?:\xd0\xb3\xd0\xbe|\xd1\x81(?:\xd0\xbb\xd0\xb8|\xd1\x82\xd1\x8c))|\xb7\xd0\xb0|\xb8\xd0(?:\xb7|\xbb\xd0\xb8)|\xbc(?:\xd0\xbe\xd0\xb6\xd0\xbd\xd0\xbe|\xd1\x8b)|\xbd(?:\xd0(?:\xb0|\xb5)|\xd1\x83\xd0\xb6\xd0\xbd\xd0\xbe)|\xbe\xd1\x82|\xbf\xd0\xbe)|\xd1(?:\x82\xd0\xbe\xd0\xbb\xd1\x8c\xd0\xba\xd0\xbe|\x87\xd1\x82\xd0\xbe|\x8d\xd1\x82\xd0\xbe)))

I hope it's versatile enough. I tries to exclude all words which are specific
for our mail server message flow. I hope this list is better than "standard"
stoplists built from fiction books corpus and not from real messages.
The list consists of 24 words, most of words are also present in "standard"
stoplists (http://snowball.tartarus.org/algorithms/russian/stop.txt for
example). The only exceptions (my additions) are "Здравствуйте" (polite form of
"Hello") and "нужно" ("should" or "need" depending on context).

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to