-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tim Meadowcroft writes: > On Monday 11 Apr 2005 21:07, Justin Mason wrote: > > Tim Meadowcroft writes: > > > I've noticed that a fair proportion of what gets thru my filters (and > > > just about the only spam that gets past the gmail spam filters on my > > > account so far) is foreign language encoded spam. > > > > > > I suppose this isn't hitting all the SpamAssassin hand-made keywords or > > > the Bayesian filters, and while I'm no SA expert I couldn't see a simple > > > config to turn it on > > > > "ok_languages" is what you're after ;) > > Sorry, I should have said, I already have ok_languages set to "en fr" to > accept just english and french, but my understanding of that config option is > that it guesses the language looking at the text content rather than reading > the encoding, and it doesn't seem to catch the russian etc. in these cases. hmm, that's disappointing :( > But spurred on by your reminder, I see there's also "ok_locales" which talks > about character encodings, so this may be a better bet (but the docs also say > that "all ISO-8859-* character sets, and Windows code page character sets, > are always permitted by default") but I'll give it a try with just "en". > > The more general point is, how good are content based filters at recognising > other languages so far ? Word breaking and stemming and the like for Bayesian > filters are not so straightforward once you start to move away from Western > European languages. this is a tricky point. I think many implement the trick we use in SpamAssassin -- simply breaking 8-bit "words" into 2-byte pairs... full word breaking is hard for Asian languages particularly. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFCWvP6MJF5cimLx9ARAi3AAKCAQQg3Fyg8uChoKNyw093onb4/RQCgl6a1 SthD2TuYG3pZejA1WlKQBhM= =Qx5N -----END PGP SIGNATURE-----
