http://bugzilla.spamassassin.org/show_bug.cgi?id=4331
------- Additional Comments From [EMAIL PROTECTED] 2005-05-19 19:28 ------- OK, here's some results! KEY --- - base: current svn trunk Firstly, some code tweaks: - no_inviz_tokens: ADD_INVIZ_TOKENS_I_PREFIX set to 0, so no invisible text tokens at all - no_decomposed: inhibiting the decomposition of body tokens, and the mapping of Message-Id/In-Reply-To, From/To/Cc, and User-Agent/X-Mailer headers -- the tweaks discussed in bug 2129. - casei: IGNORE_TITLE_CASE set to 0. in other words, fully case-insensitive for body text - no8bits: TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES set to 0. in other words, 8-bit sequences are not decomposed into byte-pairs. - no_mid: IGNORE_MSGID_TOKENS set to 1. in other words, no message-ID tokens. And some constant tweaks: - s005: FW_S_CONSTANT = 0.050 instead of default 0.100 - s015: FW_S_CONSTANT = 0.150 instead of default 0.100 - x05: FW_X_CONSTANT = 0.500 instead of default 0.538 - mps02: MIN_PROB_STRENGTH = 0.2 instead of default 0.346 - mps04: MIN_PROB_STRENGTH = 0.4 instead of default 0.346 DB SIZES -------- : jm 183...; l */results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 14:08 x05/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 11:34 s015/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 09:00 s005/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 06:10 mps04/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 03:21 mps02/results/config/dbs/bayes_toks -rw------- 1 jm jm 1298432 May 19 00:30 no_mid/results/config/dbs/bayes_toks -rw------- 1 jm jm 1306624 May 18 21:04 no8bits/results/config/dbs/bayes_toks -rw------- 1 jm jm 1306624 May 18 17:18 casei/results/config/dbs/bayes_toks -rw------- 1 jm jm 1318912 May 18 14:15 no_decomposed/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 18 12:14 no_inviz_tokens/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 18 03:40 base/results/config/dbs/bayes_toks interesting to see that 'no_decomposed' results in a larger database! I have *no* idea why that is -- I guess the decomposed tokens wind up more interesting normally, and the non-decomp ones are expired out quicker when there are decomp tokens around. GRAPHS ------ Next, some graphs. These are graphs of the P(spam) curves; ideally you want to see a big spike at the left, made up entirely of ham, a big spike on the right, made up entirely of spam, and both curving down to 0.5, where there's a smaller spike of the "unsures" that we don't want to give a score to at all. Ideally there'd be no ham > 0.5, definitely none at 0.99, and ditto vice-versa for spam. They are all visible at http://taint.org/xfer/2005/bug-4331/ . I'd have made a page on the Wiki, but that doesn't allow attachments. that's helpful! Also, the next line is the cost figures for Bayes based on thresholds of 0.20 and 0.80; "fp" = ham in [0.8 .. 1.0] range, "fn" = spam in [0.0 .. 0.2] range, "uh" = unsure ham in [0.2 .. 0.8] range, "us" = unsure spam in [0.2 .. 0.8]. - g_base_v_no_inviz_tokens.png: as you can see, there's absolutely no difference in the graphs. hmm. looks like our use of invisible tokens in Bayes isn't working and can be disabled ;) base: fp 24 fn 5 uh 815 us 2647 c 591.20 no_inviz_tokens: fp 24 fn 5 uh 815 us 2648 c 591.30 - g_base_v_no_decomposed.png: there's little difference, generally -- except that the FPs (ham in the 0.5 .. 1.0 range), and the FNs (spam in 0.0 .. 0.5) are higher. clearly not a good idea to turn off decomposition then! no_decomposed: fp 27 fn 4 uh 781 us 3097 c 661.80 - g_casei.png: this is very, very close by the graph, but on examination you can see that several hams have been pushed into the solid-spam [0.8, 1.0] range. The cost figures below confirm this. Better stick with base. casei: fp 31 fn 6 uh 801 us 2673 c 663.40 - g_no8bits.png: virtually no difference, except for some more unsureness around the middle. in my opinion again better to stick with the base. no8bits: fp 24 fn 5 uh 810 us 2733 c 599.30 - g_no_mid.png: still looks like base is better. we don't gain very much with the Message-ID tokens, but OTOH the database size increase (0.4% according to above) is pretty tiny, too, so let's just leave it in. no_mid: fp 24 fn 4 uh 816 us 2741 c 599.70 - g_s_constants.png: s005: FW_S_CONSTANT = 0.050 instead of default 0.100 fp 17 fn 4 uh 1046 us 3516 c 630.20 s015: FW_S_CONSTANT = 0.150 instead of default 0.100 fp 37 fn 7 uh 705 us 2188 c 666.30 These are interesting! To remind you -- the S constant is the strength of learned data; if S is nearer to 0, then learned data is trusted more strongly. The fact that s005 has a very low FP/FN rate compared to the normal results is very attractive. It does increase the "unsure" rate, but in our implementation that's not a big deal -- it just means that the message gets a 0 score from BAYES_50. I think exploring the low figures for S might be worthwhile. - x05: FW_X_CONSTANT = 0.500 instead of default 0.538 fp 22 fn 7 uh 753 us 2774 c 579.70 Nothing really too exciting about this one. as expected, FPs go down but FNs go up. I think we might as well stick with the normal setting. - g_mps.png: mps02: MIN_PROB_STRENGTH = 0.2 instead of default 0.346 fp 33 fn 5 uh 727 us 1913 c 599.00 mps04: MIN_PROB_STRENGTH = 0.4 instead of default 0.346 fp 23 fn 4 uh 836 us 2829 c 600.50 nothing really too exciting here either. we could possibly go up to require 0.4 for a minimum probability strength, since it seems to have the nice effect of lowering FP *and* FN at the expense of a little more BAYES_50's on the uncertain cases. But I think tweaking S would be a better way to do that. Overall: the code tweaks we have are still working well. This is good, as I was worried that spam had changed enough to make them counterproductive. One exception is the invisible-tokens stuff, which is having no effect at all, and that is probably a bug. ;) I'm going to try a few more values for the S constant, which seems to reduce FPs and FNs while increasing the BAYES_50 cases. in my opinion it'd be more valuable for us at this stage to reduce FPs and FNs, since we're not reliant on Bayes alone. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
