-------- Original-Nachricht -------- > Datum: Tue, 4 Dec 2007 14:20:17 -0600 > Von: Jeffrey Taylor <[EMAIL PROTECTED]> > An: [email protected] > Betreff: [dspam-users] SOT: algorithm explanations
Hallo Jeffrey > I like DSPAM, a lot, have been using it for several years with only one > detected false positive (when I got a real Ebay account). I like it so > much, > I am trying to use it in a new way, as a trained "interesting" ranker for > an > RSS reader. It is not doing a very good job at this. There are several > reason for this. I'd be tempted to write my own from scratch, but so much > is > already done, token parsing and persistent storage being the main ones. > It > may be that simply using another algorithm might be the answer. Is there > a > layman's description of them anywhere? Or even a suggestion for a > different > algorithm? > > The problems, AFAICT, are: > > * Bayesian CLASSIFICATION, i.e., a binary spam/ham decision. I need a > ranking, e.g. this is 79% interesting. A resonably smooth and well > populated range of interesting to uninteresting scores. > > * Formatting included in scoring, e.g. HTML tags and fragments. 3.8.0 is > much > better in this regard than 3.6.X that I was previously using. I have a > way > around this, a 4 line patch to ignore a token if both innocent_hits and > spam_hits are zero. And some utility scripts to double zero out > dspam_token_data.*_hits for user/admin specified tokens. > > * Bias against false positives. I think this can be solved by using the > processorBias attribute (remember, I am using DSPAM for spam filtering > too, > so I can't use dspam.conf). I think this is a new feature in 3.8.0 and > very > welcome. > I think you would be better served by CRM114 for this kind of task. Something like the hyperspace algorithm comes to my mind when reading your requirements. > TIA, > Jeffrey Steve -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
