On Fri, 16 Apr 2010 15:41:04 +0800 Michael Alger <ds...@mm.quex.org> wrote:
[...] > Okay, now I get why you see a big difference between the modes. > Since I live in a perfect fantasy world where all classification > errors are corrected, I wasn't seeing the significance. :) > The perfect fantasy world is where we all (the one being responsible for DSPAM. aka: the admins) live. But lazy users are unfortunately the reality. > I was browsing through the README looking to see if dspam had any > nice hooks for helping to build my own corpus, > # dspam_admin change preference ds...@mm.quex.org "makeCorpus" "on" Or default for everyone: # dspam_admin change preference default "makeCorpus" "on" Or if you don't use the preference extension then set it in dspam.conf or in individual user prefs files. > and came across this: > > tum: Train-until-Mature. This training mode is a hybrid > between the other two training modes and provides a great > balance between volatility and static metadata. > > So apparently I'm not the only one that sees TUM as something of a > combination between TEFT and TOE. However, the explanation of TUM > in the README doesn't mention TL as affecting whether it learns or > not. > From the learning method viewpoint it is not a hybrid. TEFT and TUM learn without you telling them to learn. TOE does only learn when you tell it to learn. That is the reason why I said that TUM should not be compared to TOE. But from the way how TUM work as whole it is indeed a hybrid. > Is this out of date? > No. It is still valid. > The explanation (abridged from the README version): > > TuM will train on a per-token basis only tokens which have had > fewer than 50 "hits" on them, unless an error is being retrained > in which case all tokens are trained. > > NOTE: You should corpus train before using tum. > > suggests to me that it actually learns a little differently than > TEFT (and without regard to TL), in that tokens that already have 50 > hits on them will be ignored. > The documentation is not 100% clear in this regard. Only default tokens (BNR tokens don't fall into that category. They are another token type and not a default token) having ((spam_hits + innocent_hits) < 50) are automatically trained by TUM. But for the classification any token is used in TUM. The code that does the magic in regards to training is this here: ----------------------------------- if (ds_term->type == 'D' && ( CTX->training_mode != DST_TUM || CTX->source == DSS_ERROR || CTX->source == DSS_INOCULATION || ds_term->s.spam_hits + ds_term->s.innocent_hits < 50 || ds_term->key == diction->whitelist_token || CTX->confidence < 0.70)) { ds_term->s.status |= TST_DIRTY; } ----------------------------------- Translated that means: if ([current token type] is [default token]) and ( ([training mode] is not [TUM]) or ([current message source] is [ERROR]) or ([current message source] is [INOCULATION]) or ([current token spam hits] + [current token innocent hits] is less than [50]) or ([current token key] is [WHITELIST]) or ([current message condifence] is less than [0.70 (aka 70%)]) ) then mark [current token] as [DIRTY] end if Marking a token as dirty instructs DSPAM to save back the updated token data to the used storage backend. Let's take an example: * training mode is TUM * the message source is not ERROR * the message source is not INOCULATION * for simplicity let us assume all default tokens of the message have 20 innocent hits and 20 spam hits * for simplicity let us assume the message has no whitelist token * the whole message has a confidence of 0.80 Then the above condition would result in (for each individual token): (true) and (false or false or false or true or false or false) -> (true) and (true) => true So each of the tokens would be marked dirty (aka learn the token) because we get a TRUE for ((spam_hits + innocent_hits) < 50). Now using the same values but this time each token has 40 spam hits and 40 innocent hits AND the whole message has a confidence of 0.65. Then the above condition would result in (for each individual token): (true) and (false or false or false or false or false or true) -> (true) and (true) => true As you see the individual tokens would still be trained by TUM because the whole message has a confidence less then 0.70. The training is performed regardless that each of the individual tokens has a (spam_hits + innocent_hits) above 50 (in our example 80). To sum it up: TUM would train a message (respectively parts of a message) if one of the following conditions applies: * the source is ERROR (aka --source=error) * the source is INOCULATION (aka --source=inoculation) * individual tokens have (spam_hits + innocent_hits) < 50 * individual token is a whitelist token * the whole message has a confidence factor of < 0.70/70% > Thanks again for all your explanations. > No problem. I hope my example is clear? btw: I could now be cheeky and mention that you can read all of this inside the DSPAM source code :) -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user