https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229
Mark Martinec <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Status Whiteboard| |needs 1 vote --- Comment #18 from Mark Martinec <[email protected]> 2011-05-07 00:06:12 UTC --- > Committed revision 1100378. +1 for 3.3.2 (should do for 3.3.2, we'll worry about possible more complex improvements later) > > > $word = Encode::decode_utf8($word); # set the flag > > I think that's trying to be too clever.. I believe the textcat database has > > some utf-8 signatures also. Darxus writes: > I don't, far from it. That should give you proper case conversion for the > entire set of utf8 characters. > It would be better to figure out how to set the locale to utf8 for all of SA > early on, but I think setting the flag on this variable here is cleaner than > trying to figure out the right set of characters to feed to tr(). Although I > completely agree with Mark that your tr() solution is fine for 3.3.2. You are assuming that text from mail parts is decoded from octets into perl characters (utf8) according to a MIME header field of each mail part. This is how it should ideally be, but is not currently the case. The proper solution will require work at several levels of SpamAssassin modules. There were some attempts in the past, but we were often burned by perl bugs dealing with utf8 in older versions of perl, or just horrible slowdowns. Eventually we'll need to go this way, but this probably won't happen even with the 3.4 release. For 3.3.x we can only assume there are plain octets reaching TextCat, with no associated character set or any decoding attempted. The assumed Latin1 case conversion by this patch is just an attempt to deal with 80% of cases where TextCat got it wrong so far. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.
