On Feb 15, 2004, at 7:18 pm, Grendel wrote:
There was a article about spam on freshmeat sometime back, please take a
look at it and make a decision.
<http://freshmeat.net/articles/view/964/>
The training database used in this article was too small (c 1200 messages).
As you know, "Bayesian" spam-filters depend upon "learning" the spamminess of emails from the content of already determined messages. If the training database is too small then results will not be of a predictable accuracy.
The maintainers of Bogofilter recommend that 5000 emails is a "good size", and their tuning utility is known to give fallacious results with less than a couple of thousand results. In Paul Graham's original "A Plan For Spam Article", he states that he used databases of 4000 each of ham & spam email, and if you subscribe to the Bogofilter mailing list you will see that test comparisons are regularly made using 20,000 emails.
This is not to say that those numbers of messages are required before Bogofilter, or any other statistical-analysis tool, will be useful, but due to the essential similarity of the statistical approach used by the filters reviewed, I would guess that empirical results cannot be relied upon with smaller databases.
In Grahams "Better Bayesian Filtering" article he quotes a paper given in 1998 by Pantel and Lin which tends to confirm this - their filter "only caught 92% of spam, with 1.16% false positives". Graham remarks that 92% is a fairly poor success rate by the standards of his results, and the first possible cause of this that he suggests is the low volume of messages tested by Pantel and Lin - only c 620 messages in total.
Stroller.
-- [EMAIL PROTECTED] mailing list
