Tom Allison wrote:

[...]

If that was the case then why would I consider the wheel since someone might have stumbled on that one too...

I was working on a few assumptions:

a token is a representation of essentially a regex match in either case, CRM114 or Bayes.

"a token is a representation of essentially a regex match ...": utter spout. Did you ever study statistics? I did, as part of my business economy course. It was the only branch of math that ever captured my imagination and made me want to do more.

Any overlap is purely coincidental.

What overlap?

How you manipulate the tokens, based on history, is dependent upon the method of calculation, markov/chi-square/naive, but they are dependent on the same base history of good/bad messages and good/bad tokens.

So a signature can consist of both naive derived tokens and SPBH derived tokens. Any learning or correction of that token will be to apply a correction to the historical count (+1/-1) in either case. So the data and it's history remains consistent.

The more variations you can deploy in checking for spam the better the chances that something will get trapped.

I'm happy with the proved 99.26% accuracy after 91,000+ messages with 0.45% false positives with which dspam is bountifully benefiting my high school site (and without much participation on my part) or that of my provenly mostly idle, ignorant and stupid users. That after a couple of years' shooting around with SpamAssassin and constantly using hours on twiddling hundreds of knobs to get half of the accuracy (98%). And not having the chance to give my users (see above) their democratic right to correct mistakes.

How would your users correct their mistakes with your mixture?

The biggest advantage that dspam can provide is a lighter weight naive or chi-square determination, removing some of the more obvious spam via quarantine, followed by the slower CRM114 methodology to further determine what's left over from the bayes determination.

It probably won't work because there just isn't enough data captured about the tokens.

As I wrote, I'm satisfied with 99.26% accuracy after 91,000+ messages etc. My site's Postfix 2.3 server is refusing (empirically) well over 98% of all potential spam, with around 0,1% of false positives before it ever gets to dspam. Try concentrating on that.

> But if it was truely a bad idea then why do so many
> people use multiple filters to capture spam?

Do they? Is recycling the same message base repeatedly through the same badly configured filter using "multiple filters"? If you want to use multiple filters, then use multiple filters.

--Tonni

--
Tony Earnshaw
Email: tonni at hetnet.nl

Reply via email to