https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8271
azo...@geolink-group.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |azo...@geolink-group.com --- Comment #2 from azo...@geolink-group.com --- Yes, I understand that it is difficult to generate stopwords for a language you don't know. Some short russian stopwords (such as "не") could be skipped by your python script because they have 2 characters accoring to Unicode-aware python but actually consist of 4 bytes in Unicode encoding. The check "next if $len < 3;" in sub _tokenize_line in Bayes.pm does not skip such tokens because their length is 4 bytes. Such words (2 characters encoded with 4 bytes) should be included in stopwords lists in any language despite looking too short at fist sight. You can use my russian stoplist for spamassassin rules. It's human-readable form is: bayes_stopword_ru (?^:(на|по|не|от|для|или|за|Вас|из|что|если|будет|Вам|Если|мы|Здравствуйте|есть|это|можно|только|вас|нужно|без|его)) Optimized regexp: bayes_stopword_ru (?^:(?:\xd0(?:\x92\xd0\xb0(?:\xd0\xbc|\xd1\x81)|\x95\xd1\x81\xd0\xbb\xd0\xb8|\x97\xd0\xb4\xd1\x80\xd0\xb0\xd0\xb2\xd1\x81\xd1\x82\xd0\xb2\xd1\x83\xd0\xb9\xd1\x82\xd0\xb5|\xb1(?:\xd0\xb5\xd0\xb7|\xd1\x83\xd0\xb4\xd0\xb5\xd1\x82)|\xb2\xd0\xb0\xd1\x81|\xb4\xd0\xbb\xd1\x8f|\xb5(?:\xd0\xb3\xd0\xbe|\xd1\x81(?:\xd0\xbb\xd0\xb8|\xd1\x82\xd1\x8c))|\xb7\xd0\xb0|\xb8\xd0(?:\xb7|\xbb\xd0\xb8)|\xbc(?:\xd0\xbe\xd0\xb6\xd0\xbd\xd0\xbe|\xd1\x8b)|\xbd(?:\xd0(?:\xb0|\xb5)|\xd1\x83\xd0\xb6\xd0\xbd\xd0\xbe)|\xbe\xd1\x82|\xbf\xd0\xbe)|\xd1(?:\x82\xd0\xbe\xd0\xbb\xd1\x8c\xd0\xba\xd0\xbe|\x87\xd1\x82\xd0\xbe|\x8d\xd1\x82\xd0\xbe))) I hope it's versatile enough. I tries to exclude all words which are specific for our mail server message flow. I hope this list is better than "standard" stoplists built from fiction books corpus and not from real messages. The list consists of 24 words, most of words are also present in "standard" stoplists (http://snowball.tartarus.org/algorithms/russian/stop.txt for example). The only exceptions (my additions) are "Здравствуйте" (polite form of "Hello") and "нужно" ("should" or "need" depending on context). -- You are receiving this mail because: You are the assignee for the bug.