Spam filter training stuff

Andrew Tarr Wed, 20 Aug 2003 23:54:20 -0700

Tim Wright writes:
 > 
 > so, we had a brief discussion on training spam filters the other day, and
 > about weather you can train a filter on someone else's spam data.
 > 
 > I borrowed Chris's spam data, and trained my spam filter (spamoracle) on
 > that. I also trained it on a couple of thousand of my own messages (good
 > ones --- so the filter learns to tell the difference).
 > 
 > Happily, all my spam is now being correctly diverted to my spam folder.
 > Guess this means it's OK to use a database of known spams as long as you
 > use lots of your own email for the good examples.
 >


Well, it depends on how different your legit mail is from the other
person's spam corpus, and how similar the other person's spam corpus
is to your spam. For most people in NZ I'd imagine that one person's
spam is pretty similar to another person's spam, and that they're both
quite different from their normal mail, so I'd imagine it probably
will work OK. Increasing your sample size is probably going to give a
big pay-off, especially if you don't have a big sample yourself, that
will far outweigh the disadvantage from the small disparity between
your spam and theirs. 

For some reason, I get a lot of brazilian spam, which I gather is
somewhat unusual, so I'd imagine someone else's all-English spam won't
help me much with the stuff in Portugese. And if you get legitimate
mail in Portugese and no Brazilian spam, then my spam won't help you
much (and may end up with naughty false positives as a result). 

I'm going to start trying out bayesian filters soon. I have a small
concern that they might just turn into Portugese recognizers with my
stuff. 

A.

Spam filter training stuff

Reply via email to