On Monday 11 Apr 2005 21:07, Justin Mason wrote: > Tim Meadowcroft writes: > > I've noticed that a fair proportion of what gets thru my filters (and > > just about the only spam that gets past the gmail spam filters on my > > account so far) is foreign language encoded spam. > > > > I suppose this isn't hitting all the SpamAssassin hand-made keywords or > > the Bayesian filters, and while I'm no SA expert I couldn't see a simple > > config to turn it on > > "ok_languages" is what you're after ;)
Sorry, I should have said, I already have ok_languages set to "en fr" to accept just english and french, but my understanding of that config option is that it guesses the language looking at the text content rather than reading the encoding, and it doesn't seem to catch the russian etc. in these cases. But spurred on by your reminder, I see there's also "ok_locales" which talks about character encodings, so this may be a better bet (but the docs also say that "all ISO-8859-* character sets, and Windows code page character sets, are always permitted by default") but I'll give it a try with just "en". The more general point is, how good are content based filters at recognising other languages so far ? Word breaking and stemming and the like for Bayesian filters are not so straightforward once you start to move away from Western European languages. Cheers -- Tim
