On Sun, 15 Feb 2004, Stroller uttered the following immortal words, > As you know, "Bayesian" spam-filters depend upon "learning" the > spamminess of emails from the content of already determined messages. > If the training database is too small then results will not be of a > predictable accuracy.
Correct. > "A Plan For Spam Article", he states that he used databases of 4000 > each of ham & spam email, and if you subscribe to the Bogofilter Well obviously you have researched this subject in detail and its a pleasure to discuss this with you. Several points have to be noted, 1. It is better that if the 4000 spam is your own spam, ie not ones downloaded from spamarchive.com or anywhere else, as the headers are also taken into account as well as the type of spam, ie in my region we tend to receive a lot of korean junk. So thus it is a problem to lay your hands on such a large volume of spam beforehand, once option would be to initilally collect the volume using spamassasin. 2. Which gives better results, feeding a large spam and ham archive at once, or starting from scratch let the filter make the decisions and correct it appropriately? I thought that starting from a clean slate and letting the filter make its choices and correcting them will give better results. 2. Equal volumes of spam and ham. I personally find that this tends to let in more spam than when the filter is initially trained with say a 2:1 ration of spam:ham. So I prefer to train it with more spam. Grendel. -- Grendels annoyance filter is so advanced it puts people to the killfile even before they have posted. -- [EMAIL PROTECTED] mailing list
