[Bug 7130] New: Bayes tokenization mangles/chops many UTF-8 words with accented, Cyrillic etc. letters - inappropriately assuming ISO-8859 encoding

bugzilla-daemon Tue, 03 Feb 2015 11:34:28 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7130


            Bug ID: 7130
           Summary: Bayes tokenization mangles/chops many UTF-8 words with
                    accented, Cyrillic etc. letters - inappropriately
                    assuming ISO-8859 encoding
           Product: Spamassassin
           Version: 3.4.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Plugins
          Assignee: [email protected]
          Reporter: [email protected]

Observing the outcome of Bayes tokenization (grepping the debug output
from '-D bayes' searching for 'bayes: token'), it is not uncommon to
see fragments of words as tokens, some of these followed by the first
byte of a partial (chopped) UTF-8 sequence, and some hardly recognizable
as a fragment of some word.

Consider the following text (UTF-8):

  außerdem            (= au<C3><9F>erdem )
  Šumečega            (= <C5><A0>ume<C4><8D>ega )
  Đokovićem           (= <C4><90>okovi<C4><87>em )
  Jiří nejčastější české jméno  (= Ji<C5><99><C3><AD>
                         nej<C4><8D>ast<C4><9B>j<C5><A1><C3><AD>
                         <C4><8D>esk<C3><A9> jm<C3><A9>no )
  Заглавная страница  (= <D0><97><D0><B0><D0><B3><D0><BB><D0><B0>
                         <D0><B2><D0><BD><D0><B0><D1><8F>
                         <D1><81><D1><82><D1><80><D0><B0>
                         <D0><BD><D0><B8><D1><86><D0><B0>

and see what are the resulting tokens:

  au, au<C3>, erdem
  ume, ume<C4>, ega
  okovi, okovi<C4>
  Ji, ji<C5>, ji, Ji<C5>
  nej, nej<C4>, ast, ast<C4>, j, jší
  esk, eské
  jmno, jméno
  аглавна<D1>, ани<D1>


The most problematic issue here is that some of the two-byte UTF-8
sequences (RFC 3629: UTF8-2) representing a letter are chopped,
the first byte C2..DF retained and the second byte (UTF8-tail)
treated as a delimiter and discarded.

This does not happen with all letters, just those whose UTF8-tail
byte(s) happens to fall into the 80..A0 range (of the allowed
full 80..BF range).

Victim letters are many (but not all) of the Latin accented
letters, Cyrillic letters, as well as some more esoteric letters.

Here are some of the affected letters:

  c3: À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à 
  c4: Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ Ġ 
  c5: ŀ Ł ł Ń ń Ņ ņ Ň ň ŉ Ŋ ŋ Ō ō Ŏ ŏ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š
  c6: ƀ Ɓ Ƃ ƃ Ƅ ƅ Ɔ Ƈ ƈ Ɖ Ɗ Ƌ ƌ ƍ Ǝ Ə Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ɵ Ơ
  c7: ǀ ǁ ǂ ǃ Ǆ ǅ ǆ Ǉ ǈ ǉ Ǌ ǋ ǌ Ǎ ǎ Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ Ǡ
  c8: Ȁ ȁ Ȃ ȃ Ȅ ȅ Ȇ ȇ Ȉ ȉ Ȋ ȋ Ȍ ȍ Ȏ ȏ Ȑ ȑ Ȓ ȓ Ȕ ȕ Ȗ ȗ Ș ș Ț ț Ȝ ȝ Ȟ ȟ Ƞ
  c9: ɀ Ɂ ɂ Ƀ Ʉ Ʌ Ɇ ɇ Ɉ ɉ Ɋ ɋ Ɍ ɍ Ɏ ɏ ɐ ɑ ɒ ɓ ɔ ɕ ɖ ɗ ɘ ə ɚ ɛ ɜ ɝ ɞ ɟ ɠ
  ca: ʀ ʁ ʂ ʃ ʄ ʅ ʆ ʇ ʈ ʉ ʊ ʋ ʌ ʍ ʎ ʏ ʐ ʑ ʒ ʓ ʔ ʕ ʖ ʗ ʘ ʙ ʚ ʛ ʜ ʝ ʞ ʟ ʠ
[...]
  ce: ΀ ΁ ΂ ΃ ΄ ΅ Ά · Έ Ή Ί ΋ Ό ΍ Ύ Ώ ΐ Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π
  cf: π ρ ς σ τ υ φ χ ψ ω ϊ ϋ ό ύ ώ Ϗ ϐ ϑ ϒ ϓ ϔ ϕ ϖ ϗ Ϙ ϙ Ϛ ϛ Ϝ ϝ Ϟ ϟ Ϡ
  d0: Ѐ Ё Ђ Ѓ Є Ѕ І Ї Ј Љ Њ Ћ Ќ Ѝ Ў Џ А Б В Г Д Е Ж З И Й К Л М Н О П Р
  d1: р с т у ф х ц ч ш щ ъ ы ь э ю я ѐ ё ђ ѓ є ѕ і ї ј љ њ ћ ќ ѝ ў џ Ѡ
[...]  (there are more...)


The main culprit is the following code section
in subroutine MS::Plugin::Bayes::_tokenize_line() :

  # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam strings,
  # and ISO-8859-15 alphas.  Do not split on @'s; better results keeping it.
  # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
  tr/-A-Za-z0-9,\@\*\!_'"\$.\241-\377 / /cs;

which assumes the text is in ISO-8859-15, although (according to Bug 7126,
January 2015) 68% of non-ASCII text is encoded as UTF-8 even with no
'normalization', and 89% of non-ASCII text is UTF-8 when normalize_charset
is enabled.

At our site I have replaced the above tr/// with the following statement:

  s{ ( [A-Za-z0-9,@*!_'"\$. -]+  |
       [\xC0-\xDF][\x80-\xBF]    |
       [\xE0-\xEF][\x80-\xBF]{2} |
       [\xF0-\xF4][\x80-\xBF]{3} |
       [\xA1-\xFF] ) | . }
   { defined $1 ? $1 : '' }xsge;

which preserves UTF-8 byte sequences as indivisible entities.

The only problem with the above is that it is 20 times slower
than the tr///.  For example it takes 5 ms to process 10 kB of text.
There are further expensive steps in _tokenize_line() further down
so the overall impact is not as bad as it sounds, but still...

Perhaps the cleanest solution would be to use 'Unicode character
properties' like \p{Alnum} in newer versions of Perl (see 'perluniprops'
perl man page), and keep text as characters (instead of encoded
as UTF-8 octets as we have it now, regardless of 'normalize_charset'
option).

I don't have a good solution or proposal at this time. I'll be using
the modified _tokenize_line() here to solve our immediate problem
with Bayes. Just documenting my findings before they are forgotten :)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7130] New: Bayes tokenization mangles/chops many UTF-8 words with accented, Cyrillic etc. letters - inappropriately assuming ISO-8859 encoding

Reply via email to