[Bug 7130] Bayes tokenization mangles/chops many UTF-8 words with accented, Cyrillic etc. letters - inappropriately assuming ISO-8859 encoding

bugzilla-daemon Fri, 13 Feb 2015 17:30:07 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7130


--- Comment #1 from Mark Martinec <[email protected]> ---
> The main culprit is the following code section
> in subroutine MS::Plugin::Bayes::_tokenize_line() :
>   tr/-A-Za-z0-9,\@\*\!_'"\$.\241-\377 / /cs;
> which assumes the text is in ISO-8859-15, although (according to Bug 7126,
> January 2015) 68% of non-ASCII text is encoded as UTF-8 even with no
> 'normalization', and 89% of non-ASCII text is UTF-8 when normalize_charset
> is enabled.
> 
> At our site I have replaced the above tr/// with the following statement:
>   s{ ( [A-Za-z0-9,@*!_'"\$. -]+  |
>        [\xC0-\xDF][\x80-\xBF]    |
>        [\xE0-\xEF][\x80-\xBF]{2} |
>        [\xF0-\xF4][\x80-\xBF]{3} |
>        [\xA1-\xFF] ) | . }
>    { defined $1 ? $1 : '' }xsge;
> which preserves UTF-8 byte sequences as indivisible entities.
> 
> The only problem with the above is that it is 20 times slower
> than the tr///.  For example it takes 5 ms to process 10 kB of text.
> There are further expensive steps in _tokenize_line() further down
> so the overall impact is not as bad as it sounds, but still...


Did some profiling using Devel::NYTProf on the _tokenize_line() .

The s/// above is indeed much slower than tr///, but of the overall
time taken by _tokenize_line() it takes like 10 .. 15 % of time
(e.g. 64 ms of the overall 460 ms spent in _tokenize_line for a
200 kB size message loaded with plenty of UTF-8 characters - times
include some profiling overhead).

To put it into perspective, the following clauses further down in
the same subroutine *each* take roughly the same amount of time
as the s/// above:

  $token =~ s/^[-'"\.,]+//;

  $token =~ s/[-'"\.,]+$//;

  $token =~
/^(?:a(?:ble|l(?:ready|l)|n[dy]|re)|b(?:ecause|oth)|c(?:an|ome)|e(?:ach|mail|ven)|f(?:ew|irst|or|rom)|give|h(?:a(?:ve|s)|ttp)|i(?:n(?:formation|to)|t\'s)|just|know|l(?:ike|o(?:ng|ok))|m(?:a(?:de|il(?:(?:ing|to))?|ke|ny)|o(?:re|st)|uch)|n(?:eed|o[tw]|umber)|o(?:ff|n(?:ly|e)|ut|wn)|p(?:eople|lace)|right|s(?:ame|ee|uch)|t(?:h(?:at|is|rough|e)|ime)|using|w(?:eb|h(?:ere|y)|ith(?:out)?|or(?:ld|k))|y(?:ears?|ou(?:(?:\'re|r))?))$/i

  if (CHEW_BODY_MAILADDRS && $token =~ /\S\@\S/i)



So it's not too bad, and I think the benefits are well worth it,
especially if we go for enabling the normalize_charset by default,
which makes almost 90% of all non-ASCII text be encoded in UTF-8.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7130] Bayes tokenization mangles/chops many UTF-8 words with accented, Cyrillic etc. letters - inappropriately assuming ISO-8859 encoding

Reply via email to