On Fri, 16 Apr 2010 15:41:04 +0800
Michael Alger <ds...@mm.quex.org> wrote:

[...]

> Okay, now I get why you see a big difference between the modes.
> Since I live in a perfect fantasy world where all classification
> errors are corrected, I wasn't seeing the significance. :)
> 
The perfect fantasy world is where we all (the one being responsible for DSPAM. 
aka: the admins) live. But lazy users are unfortunately the reality.


> I was browsing through the README looking to see if dspam had any
> nice hooks for helping to build my own corpus,
>
# dspam_admin change preference ds...@mm.quex.org "makeCorpus" "on"

Or default for everyone:
# dspam_admin change preference default "makeCorpus" "on"

Or if you don't use the preference extension then set it in dspam.conf or in 
individual user prefs files.


> and came across this:
> 
>   tum: Train-until-Mature.  This training mode is a hybrid
>        between the other two training modes and provides a great
>        balance between volatility and static metadata.
> 
> So apparently I'm not the only one that sees TUM as something of a
> combination between TEFT and TOE. However, the explanation of TUM
> in the README doesn't mention TL as affecting whether it learns or
> not.
>
From the learning method viewpoint it is not a hybrid. TEFT and TUM learn 
without you telling them to learn. TOE does only learn when you tell it to 
learn. That is the reason why I said that TUM should not be compared to TOE. 
But from the way how TUM work as whole it is indeed a hybrid.


> Is this out of date?
> 
No. It is still valid.


> The explanation (abridged from the README version):
> 
>   TuM will train on a per-token basis only tokens which have had
>   fewer than 50 "hits" on them, unless an error is being retrained
>   in which case all tokens are trained.
> 
>   NOTE: You should corpus train before using tum.
> 
> suggests to me that it actually learns a little differently than
> TEFT (and without regard to TL), in that tokens that already have 50
> hits on them will be ignored.
> 
The documentation is not 100% clear in this regard. Only default tokens (BNR 
tokens don't fall into that category. They are another token type and not a 
default token) having ((spam_hits + innocent_hits) < 50) are automatically 
trained by TUM. But for the classification any token is used in TUM.

The code that does the magic in regards to training is this here:
-----------------------------------
    if (ds_term->type == 'D' &&
        ( CTX->training_mode != DST_TUM  || 
          CTX->source == DSS_ERROR       ||
          CTX->source == DSS_INOCULATION ||
          ds_term->s.spam_hits + ds_term->s.innocent_hits < 50 ||
          ds_term->key == diction->whitelist_token             ||
          CTX->confidence < 0.70))
    {
        ds_term->s.status |= TST_DIRTY;
    }
-----------------------------------

Translated that means:
if ([current token type] is [default token])
    and
     (
       ([training mode] is not [TUM])
      or
       ([current message source] is [ERROR])
      or
       ([current message source] is [INOCULATION])
      or
       ([current token spam hits] + [current token innocent hits] is less than 
[50])
      or
       ([current token key] is [WHITELIST])
      or
       ([current message condifence] is less than [0.70 (aka 70%)])
     )
  then
    mark [current token] as [DIRTY]
end if

Marking a token as dirty instructs DSPAM to save back the updated token data to 
the used storage backend.

Let's take an example:
* training mode is TUM
* the message source is not ERROR
* the message source is not INOCULATION
* for simplicity let us assume all default tokens of the message have 20 
innocent hits and 20 spam hits
* for simplicity let us assume the message has no whitelist token
* the whole message has a confidence of 0.80

Then the above condition would result in (for each individual token):
(true) and (false or false or false or true or false or false) -> (true) and 
(true) => true

So each of the tokens would be marked dirty (aka learn the token) because we 
get a TRUE for ((spam_hits + innocent_hits) < 50).

Now using the same values but this time each token has 40 spam hits and 40 
innocent hits AND the whole message has a confidence of 0.65.

Then the above condition would result in (for each individual token):
(true) and (false or false or false or false or false or true) -> (true) and 
(true) => true

As you see the individual tokens would still be trained by TUM because the 
whole message has a confidence less then 0.70. The training is performed 
regardless that each of the individual tokens has a (spam_hits + innocent_hits) 
above 50 (in our example 80).

To sum it up: TUM would train a message (respectively parts of a message) if 
one of the following conditions applies:
* the source is ERROR (aka --source=error)
* the source is INOCULATION (aka --source=inoculation)
* individual tokens have (spam_hits + innocent_hits) < 50
* individual token is a whitelist token
* the whole message has a confidence factor of < 0.70/70%


> Thanks again for all your explanations.
> 
No problem. I hope my example is clear?

btw: I could now be cheeky and mention that you can read all of this inside the 
DSPAM source code :)


-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to