Re: [Dspam-user] training time?

Stevan Bajić Thu, 15 Apr 2010 09:17:52 -0700

On Thu, 15 Apr 2010 17:47:41 +0200
Stevan Bajić <ste...@bajic.ch> wrote:


> On Thu, 15 Apr 2010 17:35:43 +0800
> Michael Alger <ds...@mm.quex.org> wrote:
> 
> [...]
> > However, I don't understand why simply classifying a message using
> > TOE decrements the Training Left counter. My understanding is that
> > token statistics are only updated when retraining a misclassified
> > message; classifying a message shouldn't cause any changes here, and
> > thus logically shouldn't be construed as "training" the system.
> > 
> > Is this done purely so the statistical sedation is deactivated in
> > TOE mode after 2,500 messages have been processed, or are there
> > other reasons?
> > 
> You have the classical problem understanding statistical thinking. There is a 
> example that you will find in a lot of psychological literature that 
> demonstrates the problem most humans have with statistical thinking. The 
> problem is known in the sociopsychological literature as the "taxi/cab 
> problem". Let me quickly show you the example:
> ------------------------------------------------
> Two taxi companies are active in a city. The taxis of the company A are 
> green, those of the company B blue. The company A places 15% of the taxis, 
> the company B the remaining 85%. An at night it comes to an accident with hit 
> and run. The fleeing car was a taxi. A witness states that it was a green 
> taxi.
> 
> The court orders to examine the ability of the witnesses to be able to 
> differentiate between green and blue taxies under night view conditions. The 
> test result is: in 80% of the cases the witness was able to identify the 
> correct color and was wrong in the remaining 20% of the cases.
> 
> How high is the probability that the fleeing taxi the witness has seen at 
> that night was a taxi (green) from company A?
> ------------------------------------------------
> 
> Most people would answer here spontaneous with 80%. In fact a study has shown 
> that a majority of asked persons (among them physicians, judges and studying 
> of elite universities) answer the question with 80%.
> 
> But the correct answer is not 80% :)
> 
> Allow me to explain:
> The whole city has 1'000 taxies. 150 (green) belong to company A and 850 
> (blue) belong to company B. One of those 1'000 taxies is responsible for the 
> accident. The witness says he saw a green taxi and we know that he is correct 
> in 80% of the cases. That means in addition that he calls a blue taxi in 20% 
> of the cases green. From the 850 blue taxis he will thus call (false 
> positive) 170 green. And from the 150 green taxies he will correctly prove 
> (true positive) 120 taxies as green. In order to calculate the probability 
> that he actually saw a green taxi when he identifies a taxi (at night viewing 
> conditions) as green you need to devide all correct answers (TP) of "green" 
> with all answers (FP + TP) of "green". Therefore the probability is: 120 / ( 
> 170 + 120) = 0.41
> 
> The probability that a green taxi caused the accident if the withness means 
> to have seen a green taxi is therefore less then 50%. This probability 
> depends completely crucially on the distribution of the green and blue taxis 
> in the city. Would there be equal amount of green and blue taxies in the city 
> then the correct answer would indeed be 80%.
> 
> Most humans however incline to ignore the initial distribution (also apriori, 
> origin or initial probability). Psychologists speak in this connection of 
> "base rate neglect".
> 
Here a more detailed description from Wikipedia about "base rate neglect":
http://en.wikipedia.org/wiki/Base_rate_fallacy



> And now back to your original statement:
> ------------------------------------------------
> However, I don't understand why simply classifying a message using
> TOE decrements the Training Left counter. My understanding is that
> token statistics are only updated when retraining a misclassified
> message; classifying a message shouldn't cause any changes here, and
> thus logically shouldn't be construed as "training" the system.
> ------------------------------------------------
> 
> Without DSPAM keeping track of the TP/TN the whole calculation from above 
> would not be possible. DSPAM would not know that there are 1'000 taxies. It 
> would only know about 30 green taxies and 170 blue taxies. You might now ask 
> yourself why 30 green and why 170 blue? Easy (assuming green = bad/Spam and 
> blue = good/Ham)):
> * 1'000 taxies (processed messages) -> TP + TN
> * 170 taxies identified as green (Spam) but they where blue (Ham) -> FP
> * 30 taxies identified as blue (Ham) but they where green (Spam) -> FN
> 
> Without knowing TP and TN the whole Bayes theorem calculation would not be 
> possible. So DSPAM must keep track of them. It is indeed not a learning thing 
> but for the computation of the probability it is crucial to know that value.
> 
> And since the statistical sedation implemented in DSPAM is watering down the 
> result in order to minimize FP the whole Training Left (TL) value was 
> introduced in DSPAM to have a way to limit that watering down phase. So the 
> more DSPAM has done a positive/negative classification the more mature the 
> tokens are considered to be. So after 2'500 TP/TN the statistical sedation 
> gets automatically disabled.
> 
> I hope you understand now better why we need to update the statistics even if 
> we are not really learning (with TOE)?
> 
> Sorry for such a long mail. It is hard for me to explain some things (in 
> English which is not my native language) without going to deep into 
> statistics/mathematics. I hope my text above is easy to understand and does 
> not have to many grammatical errors?
> 
> 
> -- 
> Kind Regards from Switzerland,
> 
> Stevan Bajić
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] training time?

Reply via email to