https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7130
--- Comment #1 from Mark Martinec <[email protected]> --- > The main culprit is the following code section > in subroutine MS::Plugin::Bayes::_tokenize_line() : > tr/-A-Za-z0-9,\@\*\!_'"\$.\241-\377 / /cs; > which assumes the text is in ISO-8859-15, although (according to Bug 7126, > January 2015) 68% of non-ASCII text is encoded as UTF-8 even with no > 'normalization', and 89% of non-ASCII text is UTF-8 when normalize_charset > is enabled. > > At our site I have replaced the above tr/// with the following statement: > s{ ( [A-Za-z0-9,@*!_'"\$. -]+ | > [\xC0-\xDF][\x80-\xBF] | > [\xE0-\xEF][\x80-\xBF]{2} | > [\xF0-\xF4][\x80-\xBF]{3} | > [\xA1-\xFF] ) | . } > { defined $1 ? $1 : '' }xsge; > which preserves UTF-8 byte sequences as indivisible entities. > > The only problem with the above is that it is 20 times slower > than the tr///. For example it takes 5 ms to process 10 kB of text. > There are further expensive steps in _tokenize_line() further down > so the overall impact is not as bad as it sounds, but still... Did some profiling using Devel::NYTProf on the _tokenize_line() . The s/// above is indeed much slower than tr///, but of the overall time taken by _tokenize_line() it takes like 10 .. 15 % of time (e.g. 64 ms of the overall 460 ms spent in _tokenize_line for a 200 kB size message loaded with plenty of UTF-8 characters - times include some profiling overhead). To put it into perspective, the following clauses further down in the same subroutine *each* take roughly the same amount of time as the s/// above: $token =~ s/^[-'"\.,]+//; $token =~ s/[-'"\.,]+$//; $token =~ /^(?:a(?:ble|l(?:ready|l)|n[dy]|re)|b(?:ecause|oth)|c(?:an|ome)|e(?:ach|mail|ven)|f(?:ew|irst|or|rom)|give|h(?:a(?:ve|s)|ttp)|i(?:n(?:formation|to)|t\'s)|just|know|l(?:ike|o(?:ng|ok))|m(?:a(?:de|il(?:(?:ing|to))?|ke|ny)|o(?:re|st)|uch)|n(?:eed|o[tw]|umber)|o(?:ff|n(?:ly|e)|ut|wn)|p(?:eople|lace)|right|s(?:ame|ee|uch)|t(?:h(?:at|is|rough|e)|ime)|using|w(?:eb|h(?:ere|y)|ith(?:out)?|or(?:ld|k))|y(?:ears?|ou(?:(?:\'re|r))?))$/i if (CHEW_BODY_MAILADDRS && $token =~ /\S\@\S/i) So it's not too bad, and I think the benefits are well worth it, especially if we go for enabling the normalize_charset by default, which makes almost 90% of all non-ASCII text be encoded in UTF-8. -- You are receiving this mail because: You are the assignee for the bug.
