[Bug 7130] Bayes tokenization mangles/chops many UTF-8 words with accented, Cyrillic etc. letters - inappropriately assuming ISO-8859 encoding

bugzilla-daemon Mon, 23 Feb 2015 09:54:31 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7130


Mark Martinec <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #3 from Mark Martinec <[email protected]> ---
In summary: improved tokenization of UTF-8 -encoded text (natively or
due to normalize-charset) at some processing expense, which is relatively
minor in the overall bayes tokenization CPU usage.  Closing.

(If some time in the future we decide to switch internal text representation
to Unicode (utf8 flag on), then these 'manual' dealing with UTF-8 encoding
bytes will go away)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7130] Bayes tokenization mangles/chops many UTF-8 words with accented, Cyrillic etc. letters - inappropriately assuming ISO-8859 encoding

Reply via email to