Tim Wright writes: > > so, we had a brief discussion on training spam filters the other day, and > about weather you can train a filter on someone else's spam data. > > I borrowed Chris's spam data, and trained my spam filter (spamoracle) on > that. I also trained it on a couple of thousand of my own messages (good > ones --- so the filter learns to tell the difference). > > Happily, all my spam is now being correctly diverted to my spam folder. > Guess this means it's OK to use a database of known spams as long as you > use lots of your own email for the good examples. >
Well, it depends on how different your legit mail is from the other person's spam corpus, and how similar the other person's spam corpus is to your spam. For most people in NZ I'd imagine that one person's spam is pretty similar to another person's spam, and that they're both quite different from their normal mail, so I'd imagine it probably will work OK. Increasing your sample size is probably going to give a big pay-off, especially if you don't have a big sample yourself, that will far outweigh the disadvantage from the small disparity between your spam and theirs. For some reason, I get a lot of brazilian spam, which I gather is somewhat unusual, so I'd imagine someone else's all-English spam won't help me much with the stuff in Portugese. And if you get legitimate mail in Portugese and no Brazilian spam, then my spam won't help you much (and may end up with naughty false positives as a result). I'm going to start trying out bayesian filters soon. I have a small concern that they might just turn into Portugese recognizers with my stuff. A.
