[Bug 6229] [review] TextCat is too case sensitive

bugzilla-daemon Fri, 06 May 2011 17:06:54 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229

Mark Martinec <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|                            |needs 1 vote

--- Comment #18 from Mark Martinec <[email protected]> 2011-05-07 00:06:12 
UTC ---
> Committed revision 1100378.

+1  for 3.3.2

(should do for 3.3.2, we'll worry about possible more complex
improvements later)

> > > $word = Encode::decode_utf8($word); # set the flag
> > I think that's trying to be too clever.. I believe the textcat database has
> > some utf-8 signatures also.
Darxus writes:
> I don't, far from it.  That should give you proper case conversion for the
> entire set of utf8 characters.  
> It would be better to figure out how to set the locale to utf8 for all of SA
> early on, but I think setting the flag on this variable here is cleaner than
> trying to figure out the right set of characters to feed to tr().  Although I
> completely agree with Mark that your tr() solution is fine for 3.3.2.

You are assuming that text from mail parts is decoded from octets into
perl characters (utf8) according to a MIME header field of each mail part.
This is how it should ideally be, but is not currently the case. The proper
solution will require work at several levels of SpamAssassin modules.
There were some attempts in the past, but we were often burned by perl
bugs dealing with utf8 in older versions of perl, or just horrible slowdowns.
Eventually we'll need to go this way, but this probably won't happen
even with the 3.4 release. For 3.3.x we can only assume there are plain
octets reaching TextCat, with no associated character set or any decoding
attempted. The assumed Latin1 case conversion by this patch is just an
attempt to deal with 80% of cases where TextCat got it wrong so far.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6229] [review] TextCat is too case sensitive

Reply via email to