https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229
Summary: TextCat is too case sensitive
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P5
Component: Plugins
AssignedTo: [email protected]
ReportedBy: [email protected]
Created an attachment (id=4562)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4562)
TextCat problem sample
It seems the languages database is case sensitive. For example, all uppercase
english spams get very wonky results.
I have no idea what the best way to fix this would be, I'm using a quick fix
like this to get better results..
--- TextCat.pm.orig 2009-10-29 09:23:46.985152046 +0200
+++ TextCat.pm 2009-10-29 09:24:38.339651987 +0200
@@ -440,6 +440,7 @@
# my $non_word_characters = qr/[0-9\s]/;
for my $word (split(/[0-9\s]+/, ${$_[0]}))
{
+ $word =~ tr/A-ZÖÄÅ/a-zöäå/ if $word =~ /[a-zA-ZöäåÖÄÅ]{4}/;
$word = "\000" . $word . "\000";
my $len = length($word);
my $flen = $len;
Attached is a sample message. Running it with textcat_max_languages 20 gives
us:
ja.iso-2022-jp de zh.big5 sk.windows-1250 id sk.us-ascii cs.iso-8859-2 ca da vi
sw ms tl ne pl
Running it with my fix gives the expected single "en".
--
Configure bugmail:
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.