> From: Jason Bertoch [mailto:ja...@i6ix.com]
> Sent: Wednesday, May 26, 2010 3:34 PM
> On 2010/05/25 7:02 PM, Karsten Bräckelmann wrote:
> > On Wed, 2010-05-26 at 10:35 +1200, Jason Haar wrote:
> >
> > Not as far as ok_locales and the respective CHARSET_FARAWAY rules are
> > concerned, IIRC. They have been written long ago to trigger on the
> > char-sets used. They don't detect the char-set based on the actual
> > payload.
> >
> 
> So where does that leave us?  With the need for an update or addition
> to
> the FARAWAY rules?  Also, what's the deal with normalize_charset?  Can
> that have any impact on these cases where language/locale isn't
> detected?

Jason, I may be completely wrong, but this is what I get grepping 
'normalize_charset' in 3.3.1:


Util/DependencyInfo.pm:  desc => 'If you plan to use the normalize_charset 
config setting to detect
Conf.pm:=item normalize_charset ( 0 | 1)        (default: 0)
Conf.pm:    setting => 'normalize_charset',
Conf.pm:          $self->{parser}->lint_warn("config: normalize_charset 
requires Perl 5.8.5 or later");
Conf.pm:          $self->{parser}->lint_warn("config: normalize_charset 
requires HTML::Parser 3.46 or later");
Conf.pm:          $self->{parser}->lint_warn("config: normalize_charset 
requires Encode::Detect");
Conf.pm:          $self->{parser}->lint_warn("config: normalize_charset 
requires Encode");
Conf.pm:      $self->{normalize_charset} = 1;


You may see {normalize_charset} can be set. But... where is it used, then?
 
It may be it is used in a way I can't catch with grep, tough...

Anyway, according to perldoc, normalize_charset would "allow detecting the 
character set" used in a text content (which I believe is what you are looking 
for) and eventually convert the text to unicode.

Now, to me the encoding detection phase is probably less than an issue here, 
because a wrong encoding specified in the content's header would impair 
readability of the spam text by the recipient, which is counter-productive to 
spammers. So, the really used encoding is probably always specified in the 
header and you may use it to score mail with foreign encodings right now.

I don't believe this is going to make any difference anyway, since nowadays 
most legit mail *and* spam are moving toward utf-8 (which is probably the same 
encoding used in the sample you supplied). You would end having a less than 
useful rule, then.

You instead may want to guess the *language* used. Textcat is the reply if you 
are looking for this. But please note its algorithm is a statistic approach to 
the language detection problem: it often detects a text as being in more than 
one language, especially when the sample is too short and/or when it is too 
"polluted" with foreign or (intentionally) mistyped words.

Regards,

Giampaolo

Reply via email to