On Thu, 15 Apr 2010 17:35:43 +0800 Michael Alger <ds...@mm.quex.org> wrote:
[...] > However, I don't understand why simply classifying a message using > TOE decrements the Training Left counter. My understanding is that > token statistics are only updated when retraining a misclassified > message; classifying a message shouldn't cause any changes here, and > thus logically shouldn't be construed as "training" the system. > > Is this done purely so the statistical sedation is deactivated in > TOE mode after 2,500 messages have been processed, or are there > other reasons? > You have the classical problem understanding statistical thinking. There is a example that you will find in a lot of psychological literature that demonstrates the problem most humans have with statistical thinking. The problem is known in the sociopsychological literature as the "taxi/cab problem". Let me quickly show you the example: ------------------------------------------------ Two taxi companies are active in a city. The taxis of the company A are green, those of the company B blue. The company A places 15% of the taxis, the company B the remaining 85%. An at night it comes to an accident with hit and run. The fleeing car was a taxi. A witness states that it was a green taxi. The court orders to examine the ability of the witnesses to be able to differentiate between green and blue taxies under night view conditions. The test result is: in 80% of the cases the witness was able to identify the correct color and was wrong in the remaining 20% of the cases. How high is the probability that the fleeing taxi the witness has seen at that night was a taxi (green) from company A? ------------------------------------------------ Most people would answer here spontaneous with 80%. In fact a study has shown that a majority of asked persons (among them physicians, judges and studying of elite universities) answer the question with 80%. But the correct answer is not 80% :) Allow me to explain: The whole city has 1'000 taxies. 150 (green) belong to company A and 850 (blue) belong to company B. One of those 1'000 taxies is responsible for the accident. The witness says he saw a green taxi and we know that he is correct in 80% of the cases. That means in addition that he calls a blue taxi in 20% of the cases green. From the 850 blue taxis he will thus call (false positive) 170 green. And from the 150 green taxies he will correctly prove (true positive) 120 taxies as green. In order to calculate the probability that he actually saw a green taxi when he identifies a taxi (at night viewing conditions) as green you need to devide all correct answers (TP) of "green" with all answers (FP + TP) of "green". Therefore the probability is: 120 / ( 170 + 120) = 0.41 The probability that a green taxi caused the accident if the withness means to have seen a green taxi is therefore less then 50%. This probability depends completely crucially on the distribution of the green and blue taxis in the city. Would there be equal amount of green and blue taxies in the city then the correct answer would indeed be 80%. Most humans however incline to ignore the initial distribution (also apriori, origin or initial probability). Psychologists speak in this connection of "base rate neglect". And now back to your original statement: ------------------------------------------------ However, I don't understand why simply classifying a message using TOE decrements the Training Left counter. My understanding is that token statistics are only updated when retraining a misclassified message; classifying a message shouldn't cause any changes here, and thus logically shouldn't be construed as "training" the system. ------------------------------------------------ Without DSPAM keeping track of the TP/TN the whole calculation from above would not be possible. DSPAM would not know that there are 1'000 taxies. It would only know about 30 green taxies and 170 blue taxies. You might now ask yourself why 30 green and why 170 blue? Easy (assuming green = bad/Spam and blue = good/Ham)): * 1'000 taxies (processed messages) -> TP + TN * 170 taxies identified as green (Spam) but they where blue (Ham) -> FP * 30 taxies identified as blue (Ham) but they where green (Spam) -> FN Without knowing TP and TN the whole Bayes theorem calculation would not be possible. So DSPAM must keep track of them. It is indeed not a learning thing but for the computation of the probability it is crucial to know that value. And since the statistical sedation implemented in DSPAM is watering down the result in order to minimize FP the whole Training Left (TL) value was introduced in DSPAM to have a way to limit that watering down phase. So the more DSPAM has done a positive/negative classification the more mature the tokens are considered to be. So after 2'500 TP/TN the statistical sedation gets automatically disabled. I hope you understand now better why we need to update the statistics even if we are not really learning (with TOE)? Sorry for such a long mail. It is hard for me to explain some things (in English which is not my native language) without going to deep into statistics/mathematics. I hope my text above is easy to understand and does not have to many grammatical errors? -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user