On Sun, 15 Feb 2004, Stroller uttered the following immortal words,

> As you know, "Bayesian" spam-filters depend upon "learning" the 
> spamminess of emails from the content of already determined messages. 
> If the training database is too small then results will not be of a 
> predictable accuracy.

Correct.

> "A Plan For Spam Article", he states that he used databases of 4000 
> each of ham & spam email, and if you subscribe to the Bogofilter 

Well obviously you have researched this subject in detail and its a 
pleasure to discuss this with you.

Several points have to be noted, 

1. It is better that if the 4000 spam is your own spam, ie not ones
downloaded from spamarchive.com or anywhere else, as the headers are also
taken into account as well as the type of spam, ie in my region we tend to
receive a lot of korean junk. So thus it is a problem to lay your hands on
such a large volume of spam beforehand, once option would be to initilally
collect the volume using spamassasin.

2. Which gives better results, feeding a large spam and ham archive at 
once, or starting from scratch let the filter make the decisions and 
correct it appropriately? I thought that starting from a clean slate and 
letting the filter make its choices and correcting them will give better 
results.

2. Equal volumes of spam and ham. I personally find that this tends to let 
in more spam than when the filter is initially trained with say a 
2:1 ration of spam:ham. So I prefer to train it with more spam.

Grendel.

-- 
Grendels annoyance filter is so advanced it puts people to the killfile 
even before they have posted. 

--
[EMAIL PROTECTED] mailing list

Reply via email to