On Mon, 16 Feb 2004, Stroller uttered the following immortal words,
> When the filter analyses an incoming message it judges its sum > "bogosity" [1] by giving each word a percentage rating, based on the > number of times that word has been encountered in ham:spam (plus a > default score for new, unknown words). A product of the 10 "most > extreme" words in the message (IE: those closest to either 0% or 100%) > is calculated, so that several very hammy words (>90% say) will cause > this product to remain high, but only one or two very spammy words > (<15%) will significantly decrease it. Unfortunately spammers are wising to this trick that is why you get a spam with the standard advertisement for penis enlargement, and below that about 2000 words of genuine non spam words, like elephant, linux etc etc, this is used to trick the bayesian filter in to thinking that this is a genuine email. I find that a lot of the above type of emails tend to fool bayesian filters :( Other methods used against bayesian filters are malformed words like pen!s, [EMAIL PROTECTED]@ etc. > So, IIRC the total probable spamicity of the message is (in Graham's > original algorithm) the inverse of the product just arrived at. I don't > see why it should matter when the word tokens used to calculate this > spamicity are collected. If you train the database with several > thousand messages over a period of several months, it should result in > the same database as you'd achieve by creating a new database afresh > with those same messages. Yes, I concur with you on this point. > > That's probably because the author(s) of your spam filter think that > it's better to let in a few spam (false negatives) than to forget your > wedding anniversary because your filter has misclassified a message > from your wife (false positive). What I think you may [3] be doing here > is simply retuning your filter's heuristics - if the word "bellicose" > occurs once in 1000 spam & once in every 1000 ham, by training with > 2000 spam & 1000 ham, one might [3] fool a simplistic statistical > spam-filter into thinking that "Ah, I have received 2 spam emails > containing this word, and only one ham, so it must be a spammy word" > [3]. If this is the case, there may be better ways of tuning your > filter, but if you can comprehend > <http://www.bgl.nu/bogofilter/tuning.html> then you're a better man > than I. Taking a look. Grendel -- Grendels annoyance filter is so advanced it puts people to the killfile even before they have posted. -- [EMAIL PROTECTED] mailing list
