On Feb 16, 2004, at 12:02 am, Grendel wrote:

"A Plan For Spam Article", he states that he used databases of 4000
each of ham & spam email, and if you subscribe to the Bogofilter

Well obviously you have researched this subject in detail and its a pleasure to discuss this with you.

You flatter me, Grendel.
I just like to understand how things work, is all.
Perhaps I remember well where I read thing - I had to Google for those figures, but they were easy to find.


1. It is better that if the 4000 spam is your own spam, ie not ones
downloaded from spamarchive.com or anywhere else...

Indeed. I was saving spam for several months whilst procrastinating over installing Bogofilter, but deleted them during a disk-space crisis. The following week, when I finally got around to it, I had only 100 to train with!


But I think Ralph made a good compromise - 4000 of someone else's spam is probably better than only 100 of your own. I try to save all my spam, just in case my database ever gets corrupted (I keep meaning to tar.gz them); if one was short at the time of installing a statistical filter one could train one's original db with a mixture of "foreign" and self-harvested spam, and only retain one's own spam. Later, once enough of one's personalised spam has been acquired, one could delete the database & retrain afresh with that. I think that this would allow both excellent initial filtering results (besides, what's the difference between 97% & 99%, really?) and a later tailored fit.

2. Which gives better results, feeding a large spam and ham archive at
once, or starting from scratch let the filter make the decisions and
correct it appropriately? I thought that starting from a clean slate and
letting the filter make its choices and correcting them will give better
results.

I don't see that it makes any difference. I think one should work with what one has got.


When the filter analyses an incoming message it judges its sum "bogosity" [1] by giving each word a percentage rating, based on the number of times that word has been encountered in ham:spam (plus a default score for new, unknown words). A product of the 10 "most extreme" words in the message (IE: those closest to either 0% or 100%) is calculated, so that several very hammy words (>90% say) will cause this product to remain high, but only one or two very spammy words (<15%) will significantly decrease it.

So, IIRC the total probable spamicity of the message is (in Graham's original algorithm) the inverse of the product just arrived at. I don't see why it should matter when the word tokens used to calculate this spamicity are collected. If you train the database with several thousand messages over a period of several months, it should result in the same database as you'd achieve by creating a new database afresh with those same messages. Of course not many "Bayesian" filters are based directly upon Graham's original algorithm nowadays, and yours may see fit to weight more recent words more heavily (but I think it would probably be a clupea rubra for it to do so).

2. Equal volumes of spam and ham. I personally find that this tends to let
in more spam than when the filter is initially trained with say a
2:1 ration of spam:ham. So I prefer to train it with more spam.

That's probably because the author(s) of your spam filter think that it's better to let in a few spam (false negatives) than to forget your wedding anniversary because your filter has misclassified a message from your wife (false positive). What I think you may [3] be doing here is simply retuning your filter's heuristics - if the word "bellicose" occurs once in 1000 spam & once in every 1000 ham, by training with 2000 spam & 1000 ham, one might [3] fool a simplistic statistical spam-filter into thinking that "Ah, I have received 2 spam emails containing this word, and only one ham, so it must be a spammy word" [3]. If this is the case, there may be better ways of tuning your filter, but if you can comprehend <http://www.bgl.nu/bogofilter/tuning.html> then you're a better man than I.


Hope this verbose mail is of interest,

Stroller.



[1] I shall use ESR's term here for the sake of brevity.
[2] Note to the mathematically challenged: 90% * 10% is NOT 900%, but 9%.
This can equally be represented as (0.9 * 0.1) != 9 but (0.9 * 0.1) = 0.09.
0.09 is the same as saying 9%.
[3] But I have no idea how your spam-filter or mine works, as far as internals is concerned, so I may be mistaken.



-- [EMAIL PROTECTED] mailing list



Reply via email to