> From: Jason Bertoch [mailto:ja...@i6ix.com] > Sent: Wednesday, May 26, 2010 3:34 PM > On 2010/05/25 7:02 PM, Karsten Bräckelmann wrote: > > On Wed, 2010-05-26 at 10:35 +1200, Jason Haar wrote: > > > > Not as far as ok_locales and the respective CHARSET_FARAWAY rules are > > concerned, IIRC. They have been written long ago to trigger on the > > char-sets used. They don't detect the char-set based on the actual > > payload. > > > > So where does that leave us? With the need for an update or addition > to > the FARAWAY rules? Also, what's the deal with normalize_charset? Can > that have any impact on these cases where language/locale isn't > detected?
Jason, I may be completely wrong, but this is what I get grepping 'normalize_charset' in 3.3.1: Util/DependencyInfo.pm: desc => 'If you plan to use the normalize_charset config setting to detect Conf.pm:=item normalize_charset ( 0 | 1) (default: 0) Conf.pm: setting => 'normalize_charset', Conf.pm: $self->{parser}->lint_warn("config: normalize_charset requires Perl 5.8.5 or later"); Conf.pm: $self->{parser}->lint_warn("config: normalize_charset requires HTML::Parser 3.46 or later"); Conf.pm: $self->{parser}->lint_warn("config: normalize_charset requires Encode::Detect"); Conf.pm: $self->{parser}->lint_warn("config: normalize_charset requires Encode"); Conf.pm: $self->{normalize_charset} = 1; You may see {normalize_charset} can be set. But... where is it used, then? It may be it is used in a way I can't catch with grep, tough... Anyway, according to perldoc, normalize_charset would "allow detecting the character set" used in a text content (which I believe is what you are looking for) and eventually convert the text to unicode. Now, to me the encoding detection phase is probably less than an issue here, because a wrong encoding specified in the content's header would impair readability of the spam text by the recipient, which is counter-productive to spammers. So, the really used encoding is probably always specified in the header and you may use it to score mail with foreign encodings right now. I don't believe this is going to make any difference anyway, since nowadays most legit mail *and* spam are moving toward utf-8 (which is probably the same encoding used in the sample you supplied). You would end having a less than useful rule, then. You instead may want to guess the *language* used. Textcat is the reply if you are looking for this. But please note its algorithm is a statistic approach to the language detection problem: it often detects a text as being in more than one language, especially when the sample is too short and/or when it is too "polluted" with foreign or (intentionally) mistyped words. Regards, Giampaolo