On Sat, 10 Apr 2010 17:59:25 +0800
Michael Alger <ds...@mm.quex.org> wrote:

> On Fri, Apr 09, 2010 at 11:23:16PM -0700, Terry Barnum wrote:
> >>> I've been running DSPAM for approximately 2 weeks and looking
> >>> at the output of dspam_stats, I'm curious how long training
> >>> normally takes.
> >>>
> >>> $ cat /usr/local/dspam.conf | grep -v ^# | grep -v ^$
> >>>
> >>> TrainingMode toe
> >>> Preference "trainingMode=TOE"
> 
> Your default settings are TOE mode. Are you overriding this for any
> of the users in their preferences? If not, this would explain why
> it's only learning from errors: because you told it to.
> 
> Try switching this to TUM or TEFT.
> 
I think most users here don't understand what training is in the context of 
Anti-Spam. So I am going to try to explain quickly what all those different 
training modes are. I will try to avoid this technical/mathematical/statistical 
mabo-jambo and use something else. Sorry if I make to many grammatical errors. 
I have a hard working day behind me and I am just going to type here without 
taking much care about proper English.

The example I will use here is way oversimplified but good enough to explain 
the topic.

Okay. DSPAM has the following training modes:
* NOTRAIN
  => Do not do training

* TEFT
  => Train Everything (some say: Train Every F***ing Time)

* TUM
  => Train Until Mature

* TOE
  => Train On Error

* UNLEARN
  => Unlearn the (previous) training


Now my example:
Let us assume we have a joung human that wants to be a specialist in a specific 
knowledge area/domain. At the beginning that joung human does not know anything 
about the specific area.

Let us assume that that specific area has a lot of material that can be 
learned. That learning material is immense. Infinite. You never stop to learn. 
But let us assume that in general a human is considered to be specialist in 
that area/domain after he/she has passed 2'500 tests.

Now let us assume that each of this training material is a book with +/- 100 
pages. And let us assume that you can take for each topic a test.

Now let us assume we have 4 joung boys trying to become specialists. They are 
called (I know, I know. Stupid names but anyway....):
* NOTRAIN
* TEFT
* TUM
* TOE

NOTRAIN is never training. He just relies on what he has learned in the past 
and takes any test without learning before the test and he does not learn after 
the test. He just takes the test and regardless of the result he just continues 
to take the next test.

TEFT on the other hand is taking the test like NOTRAIN but each time after he 
has taken the test he is buying a book (+/- 100 pages) about the tested topic 
and reads/learns the book. And he continues this for each and every test. He 
does not stop after he has successfully passed 2'500 topic tests. He takes test 
2'500 and 2'501 and 2'502 and and and. He never ever stops to learn (FORCED 
LEARNING).

TUM is exactly like TEFT. He takes the test and after the test he as well is 
buying a book (+/- 100 pages) about the tested topic and reading/learning the 
book. But as soon as he has successfully passed 2'500 tests he changes his 
strategy and stops buying books after he has passed a test. He is only buying 
and reading/learning a book if he has failed on a test.

TOE is totally different from the above 3. He is taking a test and if he is 
failing to pass the test he goes on and buys a book (+/- 100 pages) about the 
tested topic and reads/learns the book. He does that for ever. Every test he 
takes he is doing the same. If he passes the test he does not buy the book and 
he does not read those +/- 100 pages. He just has passed the test and he knows 
that he has passed. So no need for him to invest time in reading 100 pages for 
nothing. He is already knowledgeable in that topic he tested (remeber: he 
passed the test).


So now allow me to glue together DSPAM with the above example. In DSPAM world 
those 2'500 tests would be TL (Training Left). And in DSPAM world each of the 
trainee from above (except NOTRAIN and obviously UNTRAIN) would take extra care 
while they have not passed at least 2'500 tests. The extra care is that in 
DSPAM you have the option called "statisticalSedation". This is a parameter 
that allows DSPAM to water down the catch rate (catch of Spam). This parameter 
exists for those out there that are absolutely paranoid about FPs (false 
positives). I could now go on and explain the mathematical/statistical reason 
behind that parameter but I save my self some time not explaining it. For now 
just accept that the parameter is there and that it allows you to tune how 
aggressive DSPAM will try to catch Spam while it has not at least processed 
2'500 innocent messages.

Okay. I think that now most of you should +/- understand what those training 
modes are and how they work in DSPAM. And each of those modes has a reason to 
be there. A lot of you might now think that some of those modes are useless and 
others are more useful. Right. All of them have a reason to be there.

Take for example the NOTRAIN mode. While it looks totally useless it might be 
well suited for ISPs (for example) that have already trained a lot of Ham/Spam 
messages and then just turn off DSPAM learning and only tag messages. Most 
people using NOTRAIN are using DSPAM on a gateway where they have pre-learned a 
lot of messages in advance and just wish to offer Anti-Spam tagging without the 
interest in the feedback from the final recipient. Those people mostly use 
other ways to train DSPAM then expecting the final recipient to send back 
his/her feedback. Such other way of training could be that they train from time 
to time with a new corpus of Ham/Spam messages or they use innoculation from 
external sources, etc.... Or another use for NOTRAIN could be to have a way 
block/filter/whatever outbound mail. There you mostly don't want the sender to 
be able to say what is Spam and what is Ham. You want to influence that. So 
NOTRAIN is the perfect way for those people. And so on and so on... there are 
many reasons for NOTRAIN.

If you went to school then you probably know class mates that can be 
categorized in TEFT, TUM or TOE.

TEFT is the one that is investing a insane amount of time in learning. He 
invests learning time in every area that he has a test for. Regardless if he 
passed another test in a similar area or not and regardless if he has passed 
the new test or not. He just learns AFTER he has taken the test without taking 
care of his result (off course with the aim to pass in the first place or to 
pass any future test).

TUM can be best described as a school mate that is like TEFT until middle 
school and after that starts to only learn if he has failed on a test.

TOE can be best described as the lazy/smart school mate. He only learns if he 
does not pass a test. If he is passing a test he is not going to invest time in 
learning the stuff from the passed test. One could say he is lazy (does not 
want to learn if he is passing) and/or one could say that he is smart (only 
learns if a test has showed him that he has not enough knowledge in the tested 
area. Smart as well because he is very economic with his (learning) time).


I personally prefer TOE because I identify myself in TOE. I as well am not 
learning if I know that I know the topic. I mean: I once took driving lessons 
and then passed the driving test and after that had my driving license. Now 
when I take the car I just sit in the car and drive. That's it. I am not going 
to take every day after I used the car a driving lesson and driving test. I 
would only take additional driving lessons and test if the law would force me 
to do additional learning (maybe because I was drunk and the police captured me 
and they took away my driving license and in order to get it back I need to 
take again driving lessons and pass a driving test) or if the stuff I learned 
has changed so significantly that I need to refresh and relearn things in order 
to be allowed to drive again. But definitely I am not going to force a learning 
lesson and a driving test if I don't need to do so. Why should I? I see no 
reason to force myself doing (for me) useless learning/testing actions.


If any one here is fluent in English and has time and is willing to correct my 
writing then do it so. Would be great if a fixed and grammer corrected version 
of the text would flow into the DSPAM Wiki so that other community members 
could profit from it.


-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to