http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686
------- Additional Comments From [EMAIL PROTECTED] 2007-10-22 07:10 ------- (In reply to comment #7) > 2. the effect of less training data, which is the real issue -- can OSBF do a > better job with tiny amounts of training, than our existing Bayes impl? results from the weekend's testing of this. I ran the 10fold cross-validation driver with "--learnprob 0.1 --randseed 23" -- ie. train on only 10% of the messages -- and got these histograms: SVN trunk: Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$252.30 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 155 0.854% Unsure: 973 2.567% (ham: 24 0.121% spam: 949 5.230%) TCRs: l=1 16.435 l=5 16.435 l=9 16.435 SUMMARY: 0.30/0.70 fp 0 fn 155 uh 24 us 949 c 252.30 SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (99.676%) ..........|....................................................... 0.000 ( 0.645%) ######## | 0.040 ( 0.040%) | 0.040 ( 0.055%) # | 0.080 ( 0.040%) | 0.080 ( 0.022%) | 0.120 ( 0.030%) | 0.120 ( 0.050%) # | 0.160 ( 0.035%) | 0.160 ( 0.022%) | 0.200 ( 0.040%) | 0.200 ( 0.028%) | 0.240 ( 0.015%) | 0.240 ( 0.033%) | 0.280 ( 0.020%) | 0.280 ( 0.077%) # | 0.320 ( 0.015%) | 0.320 ( 0.061%) # | 0.360 ( 0.015%) | 0.360 ( 0.044%) # | 0.400 ( 0.015%) | 0.400 ( 0.121%) # | 0.440 ( 0.035%) | 0.440 ( 0.198%) ## | 0.480 ( 0.020%) | 0.480 ( 3.919%) ##########|## 0.520 ( 0.314%) #### | 0.560 ( 0.165%) ## | 0.600 ( 0.149%) ## | 0.640 ( 0.077%) # | 0.680 ( 0.215%) ### | 0.720 ( 0.116%) # | 0.760 ( 0.116%) # | 0.800 ( 0.171%) ## | 0.840 ( 0.121%) # | 0.880 ( 0.193%) ## | 0.920 ( 0.336%) #### | 0.960 (92.752%) ##########|####################################################### OSBF with EDDC: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 ( 4.007%) ..........|........ 0.040 ( 3.177%) ..........|...... 0.080 (18.787%) ..........|.................................... 0.120 (28.415%) ..........|....................................................... 0.160 (17.588%) ..........|.................................. 0.160 ( 0.006%) | 0.200 (11.369%) ..........|...................... 0.200 ( 0.011%) | 0.240 ( 7.357%) ..........|.............. 0.240 ( 0.022%) # | 0.280 ( 4.574%) ..........|......... 0.280 ( 0.033%) # | 0.320 ( 3.046%) ..........|...... 0.320 ( 0.127%) #### | 0.360 ( 1.184%) ..........|.. 0.360 ( 0.303%) ######### | 0.400 ( 0.233%) ......... | 0.400 ( 0.733%) ##########|# 0.440 ( 0.046%) .. | 0.440 ( 0.424%) ##########|# 0.480 ( 0.207%) ........ | 0.480 ( 1.560%) ##########|## 0.520 ( 0.010%) | 0.520 ( 1.036%) ##########|## 0.560 ( 1.565%) ##########|## 0.600 ( 1.984%) ##########|### 0.640 ( 5.958%) ##########|######### 0.680 (20.993%) ##########|############################### 0.720 (36.795%) ##########|####################################################### 0.760 (25.143%) ##########|###################################### 0.800 ( 3.213%) ##########|##### 0.840 ( 0.083%) ## | 0.960 ( 0.011%) | the thresholds report looks like this Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$583.00 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 7 0.039% Unsure: 5760 15.195% (ham: 1838 9.300% spam: 3922 21.616%) TCRs: l=1 4.618 l=5 4.618 l=9 4.618 SUMMARY: 0.30/0.70 fp 0 fn 7 uh 1838 us 3922 c 583.00 but that's unfair, because 0.70 (as you can see from the histogram) is right in the middle of most of the ham. 0.56 would be better: Threshold optimization for hamcutoff=0.38, spamcutoff=0.56: cost=$234.80 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 55 0.303% Unsure: 899 2.372% (ham: 182 0.921% spam: 717 3.952%) TCRs: l=1 23.503 l=5 23.503 l=9 23.503 I guess it's good, but it's not stellar :( ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
