On Thu, 15 Apr 2010 17:47:41 +0200 Stevan Bajić <ste...@bajic.ch> wrote:
> On Thu, 15 Apr 2010 17:35:43 +0800 > Michael Alger <ds...@mm.quex.org> wrote: > > [...] > > However, I don't understand why simply classifying a message using > > TOE decrements the Training Left counter. My understanding is that > > token statistics are only updated when retraining a misclassified > > message; classifying a message shouldn't cause any changes here, and > > thus logically shouldn't be construed as "training" the system. > > > > Is this done purely so the statistical sedation is deactivated in > > TOE mode after 2,500 messages have been processed, or are there > > other reasons? > > > You have the classical problem understanding statistical thinking. There is a > example that you will find in a lot of psychological literature that > demonstrates the problem most humans have with statistical thinking. The > problem is known in the sociopsychological literature as the "taxi/cab > problem". Let me quickly show you the example: > ------------------------------------------------ > Two taxi companies are active in a city. The taxis of the company A are > green, those of the company B blue. The company A places 15% of the taxis, > the company B the remaining 85%. An at night it comes to an accident with hit > and run. The fleeing car was a taxi. A witness states that it was a green > taxi. > > The court orders to examine the ability of the witnesses to be able to > differentiate between green and blue taxies under night view conditions. The > test result is: in 80% of the cases the witness was able to identify the > correct color and was wrong in the remaining 20% of the cases. > > How high is the probability that the fleeing taxi the witness has seen at > that night was a taxi (green) from company A? > ------------------------------------------------ > > Most people would answer here spontaneous with 80%. In fact a study has shown > that a majority of asked persons (among them physicians, judges and studying > of elite universities) answer the question with 80%. > > But the correct answer is not 80% :) > > Allow me to explain: > The whole city has 1'000 taxies. 150 (green) belong to company A and 850 > (blue) belong to company B. One of those 1'000 taxies is responsible for the > accident. The witness says he saw a green taxi and we know that he is correct > in 80% of the cases. That means in addition that he calls a blue taxi in 20% > of the cases green. From the 850 blue taxis he will thus call (false > positive) 170 green. And from the 150 green taxies he will correctly prove > (true positive) 120 taxies as green. In order to calculate the probability > that he actually saw a green taxi when he identifies a taxi (at night viewing > conditions) as green you need to devide all correct answers (TP) of "green" > with all answers (FP + TP) of "green". Therefore the probability is: 120 / ( > 170 + 120) = 0.41 > > The probability that a green taxi caused the accident if the withness means > to have seen a green taxi is therefore less then 50%. This probability > depends completely crucially on the distribution of the green and blue taxis > in the city. Would there be equal amount of green and blue taxies in the city > then the correct answer would indeed be 80%. > > Most humans however incline to ignore the initial distribution (also apriori, > origin or initial probability). Psychologists speak in this connection of > "base rate neglect". > Here a more detailed description from Wikipedia about "base rate neglect": http://en.wikipedia.org/wiki/Base_rate_fallacy > And now back to your original statement: > ------------------------------------------------ > However, I don't understand why simply classifying a message using > TOE decrements the Training Left counter. My understanding is that > token statistics are only updated when retraining a misclassified > message; classifying a message shouldn't cause any changes here, and > thus logically shouldn't be construed as "training" the system. > ------------------------------------------------ > > Without DSPAM keeping track of the TP/TN the whole calculation from above > would not be possible. DSPAM would not know that there are 1'000 taxies. It > would only know about 30 green taxies and 170 blue taxies. You might now ask > yourself why 30 green and why 170 blue? Easy (assuming green = bad/Spam and > blue = good/Ham)): > * 1'000 taxies (processed messages) -> TP + TN > * 170 taxies identified as green (Spam) but they where blue (Ham) -> FP > * 30 taxies identified as blue (Ham) but they where green (Spam) -> FN > > Without knowing TP and TN the whole Bayes theorem calculation would not be > possible. So DSPAM must keep track of them. It is indeed not a learning thing > but for the computation of the probability it is crucial to know that value. > > And since the statistical sedation implemented in DSPAM is watering down the > result in order to minimize FP the whole Training Left (TL) value was > introduced in DSPAM to have a way to limit that watering down phase. So the > more DSPAM has done a positive/negative classification the more mature the > tokens are considered to be. So after 2'500 TP/TN the statistical sedation > gets automatically disabled. > > I hope you understand now better why we need to update the statistics even if > we are not really learning (with TOE)? > > Sorry for such a long mail. It is hard for me to explain some things (in > English which is not my native language) without going to deep into > statistics/mathematics. I hope my text above is easy to understand and does > not have to many grammatical errors? > > > -- > Kind Regards from Switzerland, > > Stevan Bajić > -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user