Re: [IMail Forum] AntiSpam Statistical Updates

R. Scott Perry Mon, 02 Aug 2004 13:35:59 -0700

We've just upgraded to v8.12 from v7.15. Love the new anti-spam
functionality, but see a major benefit in being able to update the
statistics in the word lists. I'm trying to work out the best way of
handling this as it would appear short of adding a word manually, you have
to have the email in a folder.

It sounds like you're trying to break statistics further than they have already been broken. :)

Bayesian filtering is based on statistics, with one major deviation (it makes one incorrect assumption, that makes that statistics invalid, but still useful). However, if you say "The word 'evil' should indicate spam 98% of the time instead of 68% of the time", you're wrong -- if it's 68%, it's 68%. Changing it to a made-up value (98%) will cause the statistics to deviate further from what they should be.

If you've ever see a 100% chance of spam (or 1.000, depending on how it is written), you've seen the statistical flaw in Bayesian filtering. it's impossible to have 100% or 0% chance of spam. Making up your own values just makes the calculations less accurate.

It also leads me on to how the statistical analysis learns. What logic does
it use to class a word as spam? If it sees the word "tree" for the first
time the default is that it only has a 40% chance of being spam.

That's the flaw of Bayesian filtering (and why it is correctly referred to as "naive Bayesian filtering") -- it isn't a 40% chance of spam. If you've seen something occur just once, it's impossible to make any valid statistical assumptions about it.

What happens if its then used in lots of spam, but the statistical analysis
doesn't work it out (although say a DNS BL does) to change the rating of
that word?

It should. If the engine is trained properly, when new spam comes in, the engine's database will get updated. If it gets 1,000 spams submitted and 30 have the word tree in them, and 1,000 legitimate E-mails of which 0 have the word tree in them, it will update the database so that the word tree is very likely to indicate that the E-mail is spam.

Of course, a naive Bayesian filtering system must be trained for each and every user (with *both* spam *and* legitimate E-mail) in order for it to be as effective as it was designed to be. That's one of the major drawbacks of naive Bayesian filtering on a mailserver.

-Scott --- Declude JunkMail: The advanced anti-spam solution for IMail mailservers since 2000. Declude Virus: Ultra reliable virus detection and the leader in mailserver vulnerability detection. Find out what you've been missing: Ask for a free 30-day evaluation.

---
[This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]


To Unsubscribe: http://www.ipswitch.com/support/mailing-lists.html
List Archive: http://www.mail-archive.com/imail_forum%40list.ipswitch.com/
Knowledge Base/FAQ: http://www.ipswitch.com/support/IMail/

Re: [IMail Forum] AntiSpam Statistical Updates

Reply via email to