Re: [Dspam-user] training time?

Michael Alger Fri, 16 Apr 2010 00:49:10 -0700

On Thu, Apr 15, 2010 at 12:27:47PM +0200, Stevan Bajić wrote:
> On Thu, 15 Apr 2010 17:35:43 +0800
> Michael Alger <ds...@mm.quex.org> wrote:
> 
> [...]
> 
> Learning really only happens if you tell DSPAM that a message
> needs to be reclassified or a message needs to be corpusfed. Or
> when using TEFT (regardless of TL) or TUM (only if TL < 2500).
> 
> But in order to be able to use Bayesian DSPAM needs as well to
> know how many messages it has seen in total. So it is logical that
> it needs to keep track of that by updating the table dspam_stats
> and incrementing "spam_classified" and/or "innocent_classified".


Thanks. That makes sense. Also thanks for the other explanations of
the statistical theory behind it all, which makes things a lot
clearer for me as well.

> > I think saying "TOE is totally different from {NOTRAIN, TEFT,
> > TUM}" is a little strong. It seems to me that TEFT and TOE are
> > quite different, while TUM is a combination of the two: TEFT
> > until it has enough data, and then TOE. Or have I misunderstood?
> 
> Yes. You have missunderstood. TUM and TEFT could possibly learn
> something wrong. While TOE would only learn something when you
> tell it to learn. TUM and TEFT are learning by them self. They
> FIRST learn and then depend on you to FIX errors. TOE does not do
> that. TOE only learns when you want it to learn.

Okay, now I get why you see a big difference between the modes.
Since I live in a perfect fantasy world where all classification
errors are corrected, I wasn't seeing the significance. :)

I was browsing through the README looking to see if dspam had any
nice hooks for helping to build my own corpus, and came across this:

  tum: Train-until-Mature.  This training mode is a hybrid
       between the other two training modes and provides a great
       balance between volatility and static metadata.

So apparently I'm not the only one that sees TUM as something of a
combination between TEFT and TOE. However, the explanation of TUM
in the README doesn't mention TL as affecting whether it learns or
not. Is this out of date?

The explanation (abridged from the README version):

  TuM will train on a per-token basis only tokens which have had
  fewer than 50 "hits" on them, unless an error is being retrained
  in which case all tokens are trained.

  NOTE: You should corpus train before using tum.

suggests to me that it actually learns a little differently than
TEFT (and without regard to TL), in that tokens that already have 50
hits on them will be ignored.

Thanks again for all your explanations.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] training time?

Reply via email to