http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686
------- Additional Comments From [EMAIL PROTECTED] 2007-10-30 11:24 -------
more tests. setting N_SIGNIFICANT_TOKENS to be infinite (ie. using all
tokens instead of the N most significant/strong ones), is bad:
SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam)
0.000 ( 0.506%) ..........|.
0.040 (25.658%) ..........|.................................
0.080 (43.067%)
..........|.......................................................
0.120 (22.166%) ..........|............................
0.120 ( 0.055%) # |
0.160 ( 6.275%) ..........|........
0.200 ( 1.569%) ..........|..
0.200 ( 0.055%) # |
0.240 ( 0.607%) ..........|.
0.240 ( 0.717%) ##########|#
0.280 ( 0.051%) . |
0.280 ( 0.276%) #### |
0.320 ( 0.101%) ... |
0.320 ( 0.276%) #### |
0.360 ( 0.276%) #### |
0.400 ( 0.221%) ### |
0.440 ( 0.441%) ####### |
0.480 ( 0.662%) ##########|#
0.520 ( 1.323%) ##########|#
0.560 ( 0.882%) ##########|#
0.600 ( 0.827%) ##########|#
0.640 ( 0.882%) ##########|#
0.680 ( 1.047%) ##########|#
0.720 ( 8.379%) ##########|######
0.760 (70.948%)
##########|#######################################################
0.800 (12.679%) ##########|##########
0.880 ( 0.055%) # |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$27.60
Total ham:spam: 1976:1814
FP: 0 0.000% FN: 15 0.827%
Unsure: 126 3.325% (ham: 3 0.152% spam: 123 6.781%)
TCRs: l=1 13.145 l=5 13.145 l=9 13.145
SUMMARY: 0.30/0.70 fp 0 fn 15 uh 3 us 123 c 27.60
N_SIGNIFICANT_TOKENS=999 still on the wrong side of the baseline:
SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam)
0.000 (24.747%) ..........|............................................
0.040 (18.522%) ..........|.................................
0.080 (31.123%)
..........|.......................................................
0.120 (13.057%) ..........|.......................
0.160 ( 5.820%) ..........|..........
0.160 ( 0.055%) # |
0.200 ( 4.251%) ..........|........
0.240 ( 1.822%) ..........|...
0.280 ( 0.405%) ..........|.
0.280 ( 0.110%) ### |
0.320 ( 0.152%) ..... |
0.320 ( 0.331%) ######## |
0.360 ( 0.101%) .... |
0.360 ( 0.110%) ### |
0.400 ( 0.772%) ##########|#
0.440 ( 0.165%) #### |
0.480 ( 0.717%) ##########|#
0.520 ( 0.606%) ##########|#
0.560 ( 0.992%) ##########|#
0.600 ( 1.268%) ##########|##
0.640 ( 1.985%) ##########|##
0.680 ( 7.166%) ##########|########
0.720 (24.862%) ##########|#############################
0.760 (46.472%)
##########|#######################################################
0.800 (13.671%) ##########|################
0.840 ( 0.662%) ##########|#
0.920 ( 0.055%) # |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$18.80
Total ham:spam: 1976:1814
FP: 0 0.000% FN: 1 0.055%
Unsure: 178 4.697% (ham: 13 0.658% spam: 165 9.096%)
TCRs: l=1 10.928 l=5 10.928 l=9 10.928
SUMMARY: 0.30/0.70 fp 0 fn 1 uh 13 us 165 c 18.80
N_SIGNIFICANT_TOKENS=150, ditto:
SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam)
0.000 (24.747%) ..........|............................................
0.040 (18.522%) ..........|.................................
0.080 (31.123%)
..........|.......................................................
0.120 (13.057%) ..........|.......................
0.160 ( 5.820%) ..........|..........
0.160 ( 0.055%) # |
0.200 ( 4.251%) ..........|........
0.240 ( 1.822%) ..........|...
0.280 ( 0.405%) ..........|.
0.280 ( 0.110%) ### |
0.320 ( 0.152%) ..... |
0.320 ( 0.331%) ######## |
0.360 ( 0.101%) .... |
0.360 ( 0.110%) ### |
0.400 ( 0.772%) ##########|#
0.440 ( 0.165%) #### |
0.480 ( 0.717%) ##########|#
0.520 ( 0.606%) ##########|#
0.560 ( 0.992%) ##########|#
0.600 ( 1.268%) ##########|##
0.640 ( 1.985%) ##########|##
0.680 ( 7.166%) ##########|########
0.720 (24.862%) ##########|#############################
0.760 (46.472%)
##########|#######################################################
0.800 (13.671%) ##########|################
0.840 ( 0.662%) ##########|#
0.920 ( 0.055%) # |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$18.80
Total ham:spam: 1976:1814
FP: 0 0.000% FN: 1 0.055%
Unsure: 178 4.697% (ham: 13 0.658% spam: 165 9.096%)
TCRs: l=1 10.928 l=5 10.928 l=9 10.928
SUMMARY: 0.30/0.70 fp 0 fn 1 uh 13 us 165 c 18.80
Trying out a new tokenization, where the header and URIs are simply
"split on whitespace", but the body still uses the full OSBF tokenization,
is pretty bad compared to baseline:
0.000 ( 4.706%) ..........|..........
0.040 (11.285%) ..........|........................
0.080 (11.842%) ..........|.........................
0.120 (25.860%)
..........|.......................................................
0.160 (25.607%)
..........|......................................................
0.200 (11.437%) ..........|........................
0.200 ( 0.055%) # |
0.240 ( 6.174%) ..........|.............
0.280 ( 2.429%) ..........|.....
0.280 ( 0.165%) ### |
0.320 ( 0.506%) ..........|.
0.320 ( 0.276%) ##### |
0.360 ( 0.051%) .. |
0.360 ( 0.221%) #### |
0.400 ( 0.772%) ##########|#
0.440 ( 0.221%) #### |
0.480 ( 0.101%) .... |
0.480 ( 1.433%) ##########|#
0.520 ( 0.606%) ##########|#
0.560 ( 1.433%) ##########|#
0.600 ( 2.150%) ##########|##
0.640 (16.869%) ##########|################
0.680 (58.545%)
##########|#######################################################
0.720 (17.089%) ##########|################
0.760 ( 0.110%) ## |
0.840 ( 0.055%) # |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$101.10
Total ham:spam: 1976:1814
FP: 0 0.000% FN: 1 0.055%
Unsure: 1001 26.412% (ham: 61 3.087% spam: 940 51.819%)
TCRs: l=1 1.928 l=5 1.928 l=9 1.928
SUMMARY: 0.30/0.70 fp 0 fn 1 uh 61 us 940 c 101.10
split(' ') for just headers is also not an improvement:
SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam)
0.000 (11.184%) ..........|...............
0.040 (34.615%) ..........|..............................................
0.080 (41.346%)
..........|.......................................................
0.120 (10.273%) ..........|..............
0.120 ( 0.055%) # |
0.160 ( 1.569%) ..........|..
0.200 ( 0.709%) ..........|.
0.200 ( 0.055%) # |
0.240 ( 0.304%) ........ |
0.240 ( 0.827%) ##########|#
0.280 ( 0.165%) ### |
0.320 ( 0.221%) ### |
0.360 ( 0.386%) ###### |
0.400 ( 0.551%) ######### |
0.440 ( 0.551%) ######### |
0.480 ( 1.268%) ##########|#
0.520 ( 0.992%) ##########|#
0.560 ( 1.764%) ##########|#
0.600 ( 1.488%) ##########|#
0.640 ( 5.347%) ##########|####
0.680 (70.232%)
##########|#######################################################
0.720 (14.939%) ##########|############
0.760 ( 1.103%) ##########|#
0.840 ( 0.055%) # |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$102.30
Total ham:spam: 1976:1814
FP: 0 0.000% FN: 17 0.937%
Unsure: 853 22.507% (ham: 0 0.000% spam: 853 47.023%)
TCRs: l=1 2.085 l=5 2.085 l=9 2.085
SUMMARY: 0.30/0.70 fp 0 fn 17 uh 0 us 853 c 102.30
tokenizing just URLs this way is even worse (see that FP creeping closer
to 0.0):
SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam)
0.000 (10.374%) ..........|.............
0.040 (34.109%) ..........|...........................................
0.080 (43.168%)
..........|.......................................................
0.080 ( 0.055%) # |
0.120 ( 9.818%) ..........|.............
0.160 ( 1.518%) ..........|..
0.200 ( 0.759%) ..........|.
0.200 ( 0.055%) # |
0.240 ( 0.253%) ...... |
0.240 ( 0.827%) ##########|#
0.280 ( 0.165%) ### |
0.320 ( 0.221%) ### |
0.360 ( 0.386%) ###### |
0.400 ( 0.551%) ######### |
0.440 ( 0.551%) ######### |
0.480 ( 1.323%) ##########|#
0.520 ( 0.937%) ##########|#
0.560 ( 1.985%) ##########|##
0.600 ( 1.213%) ##########|#
0.640 ( 5.788%) ##########|#####
0.680 (69.901%)
##########|#######################################################
0.720 (14.939%) ##########|############
0.760 ( 1.047%) ##########|#
0.840 ( 0.055%) # |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$102.80
Total ham:spam: 1976:1814
FP: 0 0.000% FN: 17 0.937%
Unsure: 858 22.639% (ham: 0 0.000% spam: 858 47.299%)
TCRs: l=1 2.073 l=5 2.073 l=9 2.073
SUMMARY: 0.30/0.70 fp 0 fn 17 uh 0 us 858 c 102.80
interesting! these were all tweaks I thought might help, but they
really don't -- the graphs and figures don't lie. The baseline
tokenization just works better in all my testing...
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.