http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686
------- Additional Comments From [EMAIL PROTECTED] 2007-10-25 12:11 ------- I've been doing some tokenizer tweaks, but none are really doing great; so one thing that would be handy at this point is just to restate the current "baseline" best results so far, in r585992. The full 10-fold cross-validation's histogram is the last graph in comment 6 -- I'll paste it here: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (25.415%) ..........|....................................................... 0.040 ( 9.831%) ..........|..................... 0.080 (22.571%) ..........|................................................. 0.120 (21.716%) ..........|............................................... 0.160 ( 8.435%) ..........|.................. 0.200 ( 5.444%) ..........|............ 0.200 ( 0.028%) # | 0.240 ( 3.916%) ..........|........ 0.240 ( 0.022%) # | 0.280 ( 1.801%) ..........|.... 0.280 ( 0.022%) # | 0.320 ( 0.491%) ..........|. 0.320 ( 0.226%) ##### | 0.360 ( 0.116%) ..... | 0.360 ( 0.231%) ###### | 0.400 ( 0.040%) .. | 0.400 ( 0.193%) ##### | 0.440 ( 0.132%) ### | 0.480 ( 0.223%) ..........| 0.480 ( 1.334%) ##########|## 0.520 ( 0.110%) ### | 0.560 ( 0.419%) ##########|# 0.600 ( 0.832%) ##########|# 0.640 ( 1.769%) ##########|## 0.680 ( 8.813%) ##########|########### 0.720 (36.767%) ##########|############################################ 0.760 (45.712%) ##########|####################################################### 0.800 ( 3.279%) ##########|#### 0.840 ( 0.006%) | 0.880 ( 0.011%) | 0.920 ( 0.022%) # | 0.960 ( 0.072%) ## | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$206.30 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 9 0.050% Unsure: 1973 5.205% (ham: 528 2.672% spam: 1445 7.964%) TCRs: l=1 12.479 l=5 12.479 l=9 12.479 SUMMARY: 0.30/0.70 fp 0 fn 9 uh 528 us 1445 c 206.30 Conveniently I've noticed that fold 1 is pretty representative of that graph and those numbers -- SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (27.277%) ..........|....................................................... 0.040 (10.020%) ..........|.................... 0.080 (21.356%) ..........|........................................... 0.120 (24.190%) ..........|................................................. 0.160 ( 8.654%) ..........|................. 0.200 ( 5.061%) ..........|.......... 0.200 ( 0.055%) # | 0.240 ( 2.379%) ..........|..... 0.280 ( 0.709%) ..........|. 0.280 ( 0.055%) # | 0.320 ( 0.152%) ...... | 0.320 ( 0.386%) ##########|# 0.360 ( 0.051%) .. | 0.360 ( 0.165%) #### | 0.400 ( 0.110%) ### | 0.440 ( 0.662%) ##########|# 0.480 ( 0.152%) ...... | 0.480 ( 0.937%) ##########|# 0.520 ( 0.276%) ####### | 0.560 ( 0.827%) ##########|# 0.600 ( 1.213%) ##########|## 0.640 ( 1.985%) ##########|### 0.680 (11.025%) ##########|############### 0.720 (39.802%) ##########|###################################################### 0.760 (40.463%) ##########|####################################################### 0.800 ( 1.985%) ##########|### 0.960 ( 0.055%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$20.50 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 1 0.055% Unsure: 195 5.145% (ham: 21 1.063% spam: 174 9.592%) TCRs: l=1 10.366 l=5 10.366 l=9 10.366 SUMMARY: 0.30/0.70 fp 0 fn 1 uh 21 us 174 c 20.50 This is handy because a single fold takes 1/10th of the time to run. ;) (btw note that you have to scale the "threshold optimization" cost figure 10x to cope with the corpus size differences, I should have normalized it but didn't). Anyway, I've checked it in as r588315. This is the new baseline for further tests. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
