I like DSPAM, a lot, have been using it for several years with only one detected false positive (when I got a real Ebay account). I like it so much, I am trying to use it in a new way, as a trained "interesting" ranker for an RSS reader. It is not doing a very good job at this. There are several reason for this. I'd be tempted to write my own from scratch, but so much is already done, token parsing and persistent storage being the main ones. It may be that simply using another algorithm might be the answer. Is there a layman's description of them anywhere? Or even a suggestion for a different algorithm?
The problems, AFAICT, are: * Bayesian CLASSIFICATION, i.e., a binary spam/ham decision. I need a ranking, e.g. this is 79% interesting. A resonably smooth and well populated range of interesting to uninteresting scores. * Formatting included in scoring, e.g. HTML tags and fragments. 3.8.0 is much better in this regard than 3.6.X that I was previously using. I have a way around this, a 4 line patch to ignore a token if both innocent_hits and spam_hits are zero. And some utility scripts to double zero out dspam_token_data.*_hits for user/admin specified tokens. * Bias against false positives. I think this can be solved by using the processorBias attribute (remember, I am using DSPAM for spam filtering too, so I can't use dspam.conf). I think this is a new feature in 3.8.0 and very welcome. TIA, Jeffrey
