http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686





------- Additional Comments From [EMAIL PROTECTED]  2007-10-24 03:04 -------
more meddling with tokenization.  r587841 is an experiment to discard
OSBF-style tokenization and just use the simpler SpamAssassin "split on
whitespace" tokenization with the OSBF bigram format:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 ( 9.173%) ..........|...........
0.040 (21.726%) ..........|.........................
0.040 ( 0.011%)           |
0.080 (47.814%) 
..........|.......................................................
0.080 ( 0.017%)           |
0.120 (15.204%) ..........|.................
0.120 ( 0.017%)           |
0.160 ( 3.527%) ..........|....
0.160 ( 0.006%)           |
0.200 ( 1.331%) ..........|..
0.200 ( 0.022%)           |
0.240 ( 0.653%) ..........|.
0.240 ( 0.143%) ##        |
0.280 ( 0.263%) ......    |
0.280 ( 0.397%) ######    |
0.320 ( 0.126%) ...       |
0.320 ( 0.171%) ###       |
0.360 ( 0.121%) ...       |
0.360 ( 0.243%) ####      |
0.400 ( 0.040%) .         |
0.400 ( 0.303%) #####     |
0.440 ( 0.020%)           |
0.440 ( 0.353%) ######    |
0.480 ( 0.496%) ########  |
0.520 ( 0.623%) ##########|
0.560 ( 0.579%) ######### |
0.600 ( 0.882%) ##########|#
0.640 ( 1.295%) ##########|#
0.680 ( 1.554%) ##########|#
0.720 (11.001%) ##########|#########
0.760 (69.604%) 
##########|#######################################################
0.800 (11.436%) ##########|#########
0.840 ( 0.777%) ##########|#
0.880 ( 0.011%)           |
0.960 ( 0.061%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$160.00
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:    39 0.215%
Unsure:  1210 3.192%     (ham:   113 0.572%    spam:  1097 6.046%)
TCRs:              l=1 15.972    l=5 15.972    l=9 15.972
SUMMARY: 0.30/0.70  fp     0 fn    39 uh   113 us  1097    c 160.00


So I think that basically doesn't work too well.  There are a high number of
one-off spam FNs scattered around the 0.040- 0.440 range, and ham FP at
0.880, which the more complex OSBF tokenization style avoids.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to