-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tim Meadowcroft writes:
> On Monday 11 Apr 2005 21:07, Justin Mason wrote:
> > Tim Meadowcroft writes:
> > > I've noticed that a fair proportion of what gets thru my filters (and
> > > just about the only spam that gets past the gmail spam filters on my
> > > account so far) is foreign language encoded spam.
> > >
> > > I suppose this isn't hitting all the SpamAssassin hand-made keywords or
> > > the Bayesian filters, and while I'm no SA expert I couldn't see a simple
> > > config to turn it on
> >
> > "ok_languages" is what you're after ;)
> 
> Sorry, I should have said, I already have ok_languages set to "en fr" to 
> accept just english and french, but my understanding of that config option is 
> that it guesses the language looking at the text content rather than reading 
> the encoding, and it doesn't seem to catch the russian etc. in these cases.

hmm, that's disappointing :(

> But spurred on by your reminder, I see there's also "ok_locales" which talks 
> about character encodings, so this may be a better bet (but the docs also say 
> that "all ISO-8859-* character sets, and Windows code page character sets, 
> are always permitted by default") but I'll give it a try with just "en".
> 
> The more general point is, how good are content based filters at recognising 
> other languages so far ? Word breaking and stemming and the like for Bayesian 
> filters are not so straightforward once you start to move away from Western 
> European languages.

this is a tricky point.  I think many implement the trick we use
in SpamAssassin -- simply breaking 8-bit "words" into 2-byte pairs...
full word breaking is hard for Asian languages particularly.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCWvP6MJF5cimLx9ARAi3AAKCAQQg3Fyg8uChoKNyw093onb4/RQCgl6a1
SthD2TuYG3pZejA1WlKQBhM=
=Qx5N
-----END PGP SIGNATURE-----

Reply via email to