On Thu, 15 Apr 2010 17:35:43 +0800
Michael Alger <ds...@mm.quex.org> wrote:

[...]
> However, I don't understand why simply classifying a message using
> TOE decrements the Training Left counter. My understanding is that
> token statistics are only updated when retraining a misclassified
> message; classifying a message shouldn't cause any changes here, and
> thus logically shouldn't be construed as "training" the system.
> 
> Is this done purely so the statistical sedation is deactivated in
> TOE mode after 2,500 messages have been processed, or are there
> other reasons?
> 
You have the classical problem understanding statistical thinking. There is a 
example that you will find in a lot of psychological literature that 
demonstrates the problem most humans have with statistical thinking. The 
problem is known in the sociopsychological literature as the "taxi/cab 
problem". Let me quickly show you the example:
------------------------------------------------
Two taxi companies are active in a city. The taxis of the company A are green, 
those of the company B blue. The company A places 15% of the taxis, the company 
B the remaining 85%. An at night it comes to an accident with hit and run. The 
fleeing car was a taxi. A witness states that it was a green taxi.

The court orders to examine the ability of the witnesses to be able to 
differentiate between green and blue taxies under night view conditions. The 
test result is: in 80% of the cases the witness was able to identify the 
correct color and was wrong in the remaining 20% of the cases.

How high is the probability that the fleeing taxi the witness has seen at that 
night was a taxi (green) from company A?
------------------------------------------------

Most people would answer here spontaneous with 80%. In fact a study has shown 
that a majority of asked persons (among them physicians, judges and studying of 
elite universities) answer the question with 80%.

But the correct answer is not 80% :)

Allow me to explain:
The whole city has 1'000 taxies. 150 (green) belong to company A and 850 (blue) 
belong to company B. One of those 1'000 taxies is responsible for the accident. 
The witness says he saw a green taxi and we know that he is correct in 80% of 
the cases. That means in addition that he calls a blue taxi in 20% of the cases 
green. From the 850 blue taxis he will thus call (false positive) 170 green. 
And from the 150 green taxies he will correctly prove (true positive) 120 
taxies as green. In order to calculate the probability that he actually saw a 
green taxi when he identifies a taxi (at night viewing conditions) as green you 
need to devide all correct answers (TP) of "green" with all answers (FP + TP) 
of "green". Therefore the probability is: 120 / ( 170 + 120) = 0.41

The probability that a green taxi caused the accident if the withness means to 
have seen a green taxi is therefore less then 50%. This probability depends 
completely crucially on the distribution of the green and blue taxis in the 
city. Would there be equal amount of green and blue taxies in the city then the 
correct answer would indeed be 80%.

Most humans however incline to ignore the initial distribution (also apriori, 
origin or initial probability). Psychologists speak in this connection of "base 
rate neglect".

And now back to your original statement:
------------------------------------------------
However, I don't understand why simply classifying a message using
TOE decrements the Training Left counter. My understanding is that
token statistics are only updated when retraining a misclassified
message; classifying a message shouldn't cause any changes here, and
thus logically shouldn't be construed as "training" the system.
------------------------------------------------

Without DSPAM keeping track of the TP/TN the whole calculation from above would 
not be possible. DSPAM would not know that there are 1'000 taxies. It would 
only know about 30 green taxies and 170 blue taxies. You might now ask yourself 
why 30 green and why 170 blue? Easy (assuming green = bad/Spam and blue = 
good/Ham)):
* 1'000 taxies (processed messages) -> TP + TN
* 170 taxies identified as green (Spam) but they where blue (Ham) -> FP
* 30 taxies identified as blue (Ham) but they where green (Spam) -> FN

Without knowing TP and TN the whole Bayes theorem calculation would not be 
possible. So DSPAM must keep track of them. It is indeed not a learning thing 
but for the computation of the probability it is crucial to know that value.

And since the statistical sedation implemented in DSPAM is watering down the 
result in order to minimize FP the whole Training Left (TL) value was 
introduced in DSPAM to have a way to limit that watering down phase. So the 
more DSPAM has done a positive/negative classification the more mature the 
tokens are considered to be. So after 2'500 TP/TN the statistical sedation gets 
automatically disabled.

I hope you understand now better why we need to update the statistics even if 
we are not really learning (with TOE)?

Sorry for such a long mail. It is hard for me to explain some things (in 
English which is not my native language) without going to deep into 
statistics/mathematics. I hope my text above is easy to understand and does not 
have to many grammatical errors?


-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to