Re: [Dspam-user] training time?

Stevan Bajić Thu, 15 Apr 2010 03:35:01 -0700

On Thu, 15 Apr 2010 17:35:43 +0800
Michael Alger <ds...@mm.quex.org> wrote:


[...]

> Thank you for this explanation and after a quick test I see that the
> TL counter does decrement (and TN increments) when I process mail
> using TOE. If I set it to NOTRAIN, then none of the statistics are
> updated when the messages is processed.
> 
Right.


> However, I don't understand why simply classifying a message using
> TOE decrements the Training Left counter. My understanding is that
> token statistics are only updated when retraining a misclassified
> message; classifying a message shouldn't cause any changes here, and
> thus logically shouldn't be construed as "training" the system.
> 
You are right and wrong. When classifying a message then

1) if using TEFT (regardless of TL) or TUM (only if TL < 2500) then table 
dspam_token_data gets updated and/or new entries are added.

2) if using TOE or NOTRAIN then table dspam_token_data does NOT get any new 
entries.

3) if using TOE then existing entries in table dspam_token_data (aka tokens) 
will get their "last_hit" updated but "spam_hits" nor "innocent_hits" will be 
updated.

4) if using TOE or TEFT or TUM then the table dspam_stats will be updated. But 
only fields "spam_classified" and/or "innocent_classified".


Learning is another issue. When doing learning then the stats get updated 
(fields: "spam_learned", "innocent_learned", "spam_misclassified", 
"innocent_misclassified", "spam_corpusfed", "innocent_corpusfed").


Learning really only happens if you tell DSPAM that a message needs to be 
reclassified or a message needs to be corpusfed. Or when using TEFT (regardless 
of TL) or TUM (only if TL < 2500).

But in order to be able to use Bayesian DSPAM needs as well to know how many 
messages it has seen in total. So it is logical that it needs to keep track of 
that by updating the table dspam_stats and incrementing "spam_classified" 
and/or "innocent_classified".


> Is this done purely so the statistical sedation is deactivated in
> TOE mode after 2,500 messages have been processed, or are there
> other reasons?
> 
Yes. It's only for the statistical sedation.


> Does TUM base its decision to learn purely on the value of the TL
> counter (i.e. stops learning once that reaches 0), or is the TL just
> a hint and TUM actually uses some heuristic based on the number of
> tokens available to it and their scores?
> 
No. TUM is 100% like TEFT until it reaches TL = 0. So TUM and TEFT are FORCING 
A LEARNING on each message they see. TOE is really only learning if you tell it 
to learn (no implicit learning. Only explicit learning).

To sum it up:
* TEFT (regardless of TL) or TUM (only if TL < 2500) are LEARNING EVERY message 
they see.

* TOE is only learning if you TELL IT TO LEARN.

* TEFT (regardless of TL) or TUM (only if TL < 2500) could even LEARN WRONG and 
depend on you to fix their errors. If you have TEFT or TUM (until TL = 0) and 
you DON'T correct errors then the quality of your tokens can decrease (but it 
can increase as well. But only if no classified message was a FP or a FN).



> Is TL used by anything other than the statistical sedation feature?
> 
No.


> I think saying "TOE is totally different from {NOTRAIN, TEFT, TUM}"
> is a little strong. It seems to me that TEFT and TOE are quite
> different, while TUM is a combination of the two: TEFT until it has
> enough data, and then TOE. Or have I misunderstood?
> 
Yes. You have missunderstood. TUM and TEFT could possibly learn something 
wrong. While TOE would only learn something when you tell it to learn. TUM and 
TEFT are learning by them self. They FIRST learn and then depend on you to FIX 
errors. TOE does not do that. TOE only learns when you want it to learn.

Allow me to illustrate something.

Assume you have 1000 tokens in DSPAM. And assume you have a corpus A with 100 
messages and corpus B with 100 messages.

Test case 1)
Now assume you use TEFT/TUM and you check all those mails from corpus A. And 
assume you get 100% accuracy.

Test case 2)
Now assume you use TOE and you check all those mails from corpus A. And assume 
you get as well 100% accuracy.


So far, so god. Now assume you only CLASSIFY corpus B with test case 1 and with 
test case 2. And assume we don't care about the result we got by just 
classifying mails from corpus B.

Now go back and repeat the classification the same way as done above with 
corpus A.

With test case 2 you will get again 100%. For sure!

With test case 1 you have a high chance to NOT get again 100%. The reason for 
that is that TEFT and TUM would have changed "spam_learned" and 
"innocent_learned" while they only CLASSIFIED corpus B. They have learned even 
if you have told it to only classify corpus B.

Do you understand what I mean?



-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] training time?

Reply via email to