I like DSPAM, a lot, have been using it for several years with only one
detected false positive (when I got a real Ebay account).  I like it so much,
I am trying to use it in a new way, as a trained "interesting" ranker for an
RSS reader.  It is not doing a very good job at this.  There are several
reason for this.  I'd be tempted to write my own from scratch, but so much is
already done, token parsing and persistent storage being the main ones.  It
may be that simply using another algorithm might be the answer.  Is there a
layman's description of them anywhere?  Or even a suggestion for a different
algorithm?

The problems, AFAICT, are:

* Bayesian CLASSIFICATION, i.e., a binary spam/ham decision.  I need a
  ranking, e.g. this is 79% interesting.  A resonably smooth and well
  populated range of interesting to uninteresting scores.

* Formatting included in scoring, e.g. HTML tags and fragments.  3.8.0 is much
  better in this regard than 3.6.X that I was previously using.  I have a way
  around this, a 4 line patch to ignore a token if both innocent_hits and
  spam_hits are zero.  And some utility scripts to double zero out
  dspam_token_data.*_hits for user/admin specified tokens.

* Bias against false positives.  I think this can be solved by using the
  processorBias attribute (remember, I am using DSPAM for spam filtering too,
  so I can't use dspam.conf).  I think this is a new feature in 3.8.0 and very
  welcome.

TIA,
   Jeffrey

Reply via email to