On Monday 11 Apr 2005 21:07, Justin Mason wrote:
> Tim Meadowcroft writes:
> > I've noticed that a fair proportion of what gets thru my filters (and
> > just about the only spam that gets past the gmail spam filters on my
> > account so far) is foreign language encoded spam.
> >
> > I suppose this isn't hitting all the SpamAssassin hand-made keywords or
> > the Bayesian filters, and while I'm no SA expert I couldn't see a simple
> > config to turn it on
>
> "ok_languages" is what you're after ;)

Sorry, I should have said, I already have ok_languages set to "en fr" to 
accept just english and french, but my understanding of that config option is 
that it guesses the language looking at the text content rather than reading 
the encoding, and it doesn't seem to catch the russian etc. in these cases.

But spurred on by your reminder, I see there's also "ok_locales" which talks 
about character encodings, so this may be a better bet (but the docs also say 
that "all ISO-8859-* character sets, and Windows code page character sets, 
are always permitted by default") but I'll give it a try with just "en".

The more general point is, how good are content based filters at recognising 
other languages so far ? Word breaking and stemming and the like for Bayesian 
filters are not so straightforward once you start to move away from Western 
European languages.

Cheers

--
Tim

Reply via email to