* On 2002.08.22, in <[EMAIL PROTECTED]>, * "Daniel Buchmann" <[EMAIL PROTECTED]> wrote: > > Let's say 99.99% of all spam is in english (which is my experience), and > my mother tongue is norwegian. ;)
Funny: my experience is that 60% of all spam is in Chinese, Spanish, or Turkish. :) > Let's also say that I usually never receive mails written in english. > > The Bayesian approach would then put all english words in a bad-words > list (except words found in headers), and all norwegian words in a > good-words list, wouldn't it? Not quite. It would put the limited subset of English words which appear in spam into a bad-words list. This isn't nit-picking: as soon as you save one legitimate English message to your good list (copy one from Usenet if you need to), the stats are weighted. Add another legit one, and it's even smarter. If you *never* receive legitimate English mail, this is not a problem to you: all English truly is spam. But if you *sometimes* receive English mail, add them all to your good list for a short while, and you'll find that the filter figures it out. > 1. What happens the day I join an english mailing list, or receive a > mail written in english? It *might* be marked as spam; it depends significantly on the headers, and on any words shared between Norwegian and English. > 2. What happens if I receive a mail written in norwegian but containing > a few english words, i.e. quoting someone? Nothing special. The volume of good Norwegian will make any English in your message matter little. However, it's not necessarily likely that these few English words will even be in your bad-words list. > I'd say it would discard mail #1, but let through #2... > What do you think..? The moral is: don't activate a Bayesian filter and start filing all its discoveries to the bitbucket. Watch it for a while, and when you're comfortable, start saving its finds to a circular file. Check this file occasionally, and file any false positives to your good-words list. Also, meanwhile, file any missed spam to your bad-words list. Why I say these things: I've been using my home-brewed system based on this article for a few days now, and it's pretty sharp. It's missed a few, but it's learning quickly. It initially had some false positives: since I receive postmaster and abuse mail at my domain, forwarded spam got flagged. But it's since learned to ignore those, too, while flagging the actual spam messages contained in those messages when they arrive separately. And so far, it only 94 spams, 99 non-spams in the database, which together provide 21,000 text tokens I have data on. Graham's article discusses having 4000 of each, IIRC. I expect even better results as I approach that, but I'm letting it happen naturally at this point. (I'm certain that I could get the same results with fewer messages; I started out by filing in a big pile of known messages, and I've been fine-tuning since. There's lots of overlap in the statistical value provided by these messages. At some point I'll clean out my database completely and begin again, tabula rasa, to see how few messages it takes to give me satisfactory results. But I probably won't try this until I'm more or less done gnashing the software.) In a mailing list server, you'd want to set up for identified spam to be redirected for moderator approval, rather than flinging it away. Once your databases are fleshed out pretty well, you might be able to start rejecting messages above some very high rating (say, 98%) and subnmitting for approval those above something lower (say, 90%). -- -D. We establised a fine coffee. What everybody can say Sun Project, APC/UCCO TASTY! It's fresh, so-mild, with some special coffee's University of Chicago bitter and sourtaste. "LET'S HAVE SUCH A COFFEE! NOW!" [EMAIL PROTECTED] Please love CAFE MIAMI. Many thanks. _______________________________________________ Mailman-Developers mailing list [EMAIL PROTECTED] http://mail.python.org/mailman-21/listinfo/mailman-developers