RE: [IMail Forum] AntiSpam Statistical Updates

Glenn Smith Mon, 02 Aug 2004 14:03:02 -0700

Hi Scott

Thanks for the reply. I think I didn't explain myself very well. I guess the
question is how do I effectively make the system learn? Will it do it
automatically (how, what's the intelligence like), or will I need to provide
good and spam emails for it to work on?


In the past we've used POP File locally which we've trained. It's been very
accurate for us, but then based on our training I appreciate it may not be
for others.

However I'm keen to use the imail analysis to it's full potential. I know
you can give it a mailbox to learn from, but I was wondering if I could
output a list of spam and good from outlook in to a text file and then call
it folder.mbx for the antispamseeder.exe tool to work on.

Having search the list, some people swear by ASSP. I guess there seems no
point in having some detection if you can not update it easily.

G.

-----Original Message-----
We've just upgraded to v8.12 from v7.15. Love the new anti-spam
functionality, but see a major benefit in being able to update the
statistics in the word lists. I'm trying to work out the best way of
handling this as it would appear short of adding a word manually, you have
to have the email in a folder.


It sounds like you're trying to break statistics further than they have
already been broken. :)

Bayesian filtering is based on statistics, with one major deviation (it
makes one incorrect assumption, that makes that statistics invalid, but
still useful). However, if you say "The word 'evil' should indicate spam 98%
of the time instead of 68% of the time", you're wrong -- if it's 68%, it's
68%. Changing it to a made-up value (98%) will cause the statistics to
deviate further from what they should be.

If you've ever see a 100% chance of spam (or 1.000, depending on how it is
written), you've seen the statistical flaw in Bayesian filtering. it's
impossible to have 100% or 0% chance of spam. Making up your own values just
makes the calculations less accurate.


It also leads me on to how the statistical analysis learns. What logic does
it use to class a word as spam? If it sees the word "tree" for the first
time the default is that it only has a 40% chance of being spam.


That's the flaw of Bayesian filtering (and why it is correctly referred to
as "naive Bayesian filtering") -- it isn't a 40% chance of spam. If you've
seen something occur just once, it's impossible to make any valid
statistical assumptions about it.


What happens if its then used in lots of spam, but the statistical analysis
doesn't work it out (although say a DNS BL does) to change the rating of
that word?


It should. If the engine is trained properly, when new spam comes in, the
engine's database will get updated. If it gets 1,000 spams submitted and 30
have the word tree in them, and 1,000 legitimate E-mails of which 0 have the
word tree in them, it will update the database so that the word tree is very
likely to indicate that the E-mail is spam.

Of course, a naive Bayesian filtering system must be trained for each and
every user (with *both* spam *and* legitimate E-mail) in order for it to be
as effective as it was designed to be. That's one of the major drawbacks of
naive Bayesian filtering on a mailserver.


-Scott


To Unsubscribe: http://www.ipswitch.com/support/mailing-lists.html
List Archive: http://www.mail-archive.com/imail_forum%40list.ipswitch.com/
Knowledge Base/FAQ: http://www.ipswitch.com/support/IMail/

RE: [IMail Forum] AntiSpam Statistical Updates

Reply via email to