Tom Allison wrote:
[...]
If that was the case then why would I consider the wheel since someone
might have stumbled on that one too...
I was working on a few assumptions:
a token is a representation of essentially a regex match in either case,
CRM114 or Bayes.
"a token is a representation of essentially a regex match ...": utter
spout. Did you ever study statistics? I did, as part of my business
economy course. It was the only branch of math that ever captured my
imagination and made me want to do more.
Any overlap is purely coincidental.
What overlap?
How you manipulate the tokens, based on history, is dependent upon the
method of calculation, markov/chi-square/naive, but they are dependent
on the same base history of good/bad messages and good/bad tokens.
So a signature can consist of both naive derived tokens and SPBH derived
tokens.
Any learning or correction of that token will be to apply a correction
to the historical count (+1/-1) in either case. So the data and it's
history remains consistent.
The more variations you can deploy in checking for spam the better the
chances that something will get trapped.
I'm happy with the proved 99.26% accuracy after 91,000+ messages with
0.45% false positives with which dspam is bountifully benefiting my high
school site (and without much participation on my part) or that of my
provenly mostly idle, ignorant and stupid users. That after a couple of
years' shooting around with SpamAssassin and constantly using hours on
twiddling hundreds of knobs to get half of the accuracy (98%). And not
having the chance to give my users (see above) their democratic right to
correct mistakes.
How would your users correct their mistakes with your mixture?
The biggest advantage that dspam can provide is a lighter weight naive
or chi-square determination, removing some of the more obvious spam via
quarantine, followed by the slower CRM114 methodology to further
determine what's left over from the bayes determination.
It probably won't work because there just isn't enough data captured
about the tokens.
As I wrote, I'm satisfied with 99.26% accuracy after 91,000+ messages
etc. My site's Postfix 2.3 server is refusing (empirically) well over
98% of all potential spam, with around 0,1% of false positives before it
ever gets to dspam. Try concentrating on that.
> But if it was truely a bad idea then why do so many
> people use multiple filters to capture spam?
Do they? Is recycling the same message base repeatedly through the same
badly configured filter using "multiple filters"? If you want to use
multiple filters, then use multiple filters.
--Tonni
--
Tony Earnshaw
Email: tonni at hetnet.nl