I'm running a second bayesian filter - using spamprobe - but I'm not feeding the entire message into it. I'm only feeding the headers - not the body of the message. I am scanning the body and also feeding in any links in the body and email addresses in the body. Will probably include phone numbers as well - but that's all. I am also using Exim to "enhance" the headers. I'm looking up DNS information on the senders domain - their MX servers, the zone information on the connecting IP address and a few other things to make the headers have a lot more info to work with.
And - it's working EXTREMELY WELL.
Why - you might ask - does it work better with less information?
Different parts of the message are spammier than other parts. The most spammy part of the message is the message headers, especially the subject line, and the URLs that it links to. Generally spam isn't sent the same way that ham is sent and the bayesian filter can catch that. So what I'm doing is only looking at the hottest parts of the email and disregarding most of the body.
One of the immediate advantages is that messases that contain random text to confuse bayesian filters have no effect on this one. And if someone gets a spam and forwards it to me - it's not going to score very high. And it works so well that the rest of you developers should really look into this and do it right.
So - you may ask - how did I implement this?
I'm using Exim and Spam Assassin and using Spamprobe as the second bayesian database. Spamprobe is simple to implement and interface. What I do is I take the messages coming out of spam assassin and look for the autolearn tags so that Spamprobe is trained on the same messages that Spam Assassin is trained on. I have IMAP feedback folders as well so users can drage spam into spam-missed folders and I pick those up and train SA and Spamprobe on these as well.
Messages going into Spamprobe are first run through a perl script that removes the message body exceot for email addresses and links. So spamprobe is trained on the same messages - but only part of the message.
New email coming in is first tested with spamprobe to see how it scores. Again - only the headers and links are tested. Spamprobe returns a number between 0 and 1 with 0 = ham and 1 = spam. I pipe the result into another perl script that returns a header with 9 different words as to what the result is. The middle 50% is neutral. The next 15% on both sides (10-25 and 75-90) is low. The next 9% (1-10 - 90-99) is high. The next 0.9% is very high - and the last 0.1% is extreme.
These words are added through a header on the way into spam assassin and spam assassin scores them. I've assigned scores as follows:
score SP_HAM_EXTREME -8
score SP_HAM_VERY -5
score SP_HAM_HIGH -2
score SP_HAM_LOW -1
score SP_SPAM_LOW 1 score SP_SPAM_HIGH 2
score SP_SPAM_VERY 5 score SP_SPAM_EXTREME 8
What I am seeing is that although spam Assassin's bayesian filter is pretty good - it's not as good as the spamprobe filter when it's fed with only the hottest part of the message. The way I cobbled this together probably isn't the best way to do it - but it is good enough to show me that the concept does work. It is working so well that my overall accuracy - which includes all the tricks I'm using - is now almost 100%.
I'm still tweaking this but I am happy to share with anyone interested what I'm doing and how I'm doing it. And I want to encourage everyone to look into this idea of using partial message in bayesian filtering. I'm running both filters now. I don't know yet if both are necessary in the long run. I like the idea of two filters looking at different data. It makes me wonder about having miltiple filters all looking a different parts of the messages independently and then scoring them all separately.
