[Bug 4331] check Bayes tokenizer tweaks against current spam

bugzilla-daemon Thu, 19 May 2005 21:33:24 -0700

http://bugzilla.spamassassin.org/show_bug.cgi?id=4331






------- Additional Comments From [EMAIL PROTECTED]  2005-05-19 19:28 -------
OK, here's some results!


KEY
---

- base: current svn trunk


Firstly, some code tweaks:

- no_inviz_tokens: ADD_INVIZ_TOKENS_I_PREFIX set to 0, so no invisible text
  tokens at all

- no_decomposed: inhibiting the decomposition of body tokens, and the mapping
  of Message-Id/In-Reply-To, From/To/Cc, and User-Agent/X-Mailer headers -- the
  tweaks discussed in bug 2129.

- casei: IGNORE_TITLE_CASE set to 0.  in other words, fully case-insensitive
  for body text

- no8bits: TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES set to 0.  in other words,
  8-bit sequences are not decomposed into byte-pairs.

- no_mid: IGNORE_MSGID_TOKENS set to 1.  in other words, no message-ID
  tokens.


And some constant tweaks:

- s005: FW_S_CONSTANT = 0.050 instead of default 0.100

- s015: FW_S_CONSTANT = 0.150 instead of default 0.100

- x05: FW_X_CONSTANT = 0.500 instead of default 0.538

- mps02: MIN_PROB_STRENGTH = 0.2 instead of default 0.346

- mps04: MIN_PROB_STRENGTH = 0.4 instead of default 0.346


DB SIZES
--------

: jm 183...; l */results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 14:08 x05/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 11:34 s015/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 09:00 s005/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 06:10 mps04/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 03:21 mps02/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1298432 May 19 00:30 no_mid/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1306624 May 18 21:04 no8bits/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1306624 May 18 17:18 casei/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1318912 May 18 14:15 
no_decomposed/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 18 12:14
no_inviz_tokens/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 18 03:40 base/results/config/dbs/bayes_toks

interesting to see that 'no_decomposed' results in a larger database!
I have *no* idea why that is -- I guess the decomposed tokens wind up
more interesting normally, and the non-decomp ones are expired out
quicker when there are decomp tokens around.


GRAPHS
------

Next, some graphs.  These are graphs of the P(spam) curves; ideally you want to
see a big spike at the left, made up entirely of ham, a big spike on the right,
made up entirely of spam, and both curving down to 0.5, where there's a smaller
spike of the "unsures" that we don't want to give a score to at all.  Ideally
there'd be no ham > 0.5, definitely none at 0.99, and ditto vice-versa for
spam.

They are all visible at http://taint.org/xfer/2005/bug-4331/ .  I'd have made
a page on the Wiki, but that doesn't allow attachments.  that's helpful!

Also, the next line is the cost figures for Bayes based on thresholds of 0.20
and 0.80; "fp" = ham in [0.8 .. 1.0] range, "fn" = spam in [0.0 .. 0.2] range,
"uh" = unsure ham in [0.2 .. 0.8] range, "us" = unsure spam in [0.2 .. 0.8].


- g_base_v_no_inviz_tokens.png: as you can see, there's absolutely no
  difference in the graphs. hmm. looks like our use of invisible tokens in
  Bayes isn't working and can be disabled ;)

base:             fp    24 fn     5 uh   815 us  2647    c 591.20
no_inviz_tokens:  fp    24 fn     5 uh   815 us  2648    c 591.30

- g_base_v_no_decomposed.png: there's little difference, generally -- except
  that the FPs (ham in the 0.5 .. 1.0 range), and the FNs (spam in 0.0 .. 0.5)
  are higher.  clearly not a good idea to turn off decomposition then!

no_decomposed:    fp    27 fn     4 uh   781 us  3097    c 661.80

- g_casei.png: this is very, very close by the graph, but on examination you
  can see that several hams have been pushed into the solid-spam [0.8, 1.0]
  range.  The cost figures below confirm this.  Better stick with base.

casei:            fp    31 fn     6 uh   801 us  2673    c 663.40

- g_no8bits.png: virtually no difference, except for some more unsureness
  around the middle.  in my opinion again better to stick with the base.

no8bits:          fp    24 fn     5 uh   810 us  2733    c 599.30

- g_no_mid.png: still looks like base is better.  we don't gain very
  much with the Message-ID tokens, but OTOH the database size increase
  (0.4% according to above) is pretty tiny, too, so let's just leave
  it in.

no_mid:           fp    24 fn     4 uh   816 us  2741    c 599.70

- g_s_constants.png:

  s005: FW_S_CONSTANT = 0.050 instead of default 0.100
                  fp    17 fn     4 uh  1046 us  3516    c 630.20
  s015: FW_S_CONSTANT = 0.150 instead of default 0.100
                  fp    37 fn     7 uh   705 us  2188    c 666.30

  These are interesting!   To remind you -- the S constant is the strength of
  learned data; if S is nearer to 0, then learned data is trusted more
  strongly.

  The fact that s005 has a very low FP/FN rate compared to the normal
  results is very attractive.  It does increase the "unsure" rate,
  but in our implementation that's not a big deal -- it just means
  that the message gets a 0 score from BAYES_50.

  I think exploring the low figures for S might be worthwhile.


- x05: FW_X_CONSTANT = 0.500 instead of default 0.538
                  fp    22 fn     7 uh   753 us  2774    c 579.70

  Nothing really too exciting about this one.  as expected, FPs go
  down but FNs go up.  I think we might as well stick with the
  normal setting.

- g_mps.png:

  mps02: MIN_PROB_STRENGTH = 0.2 instead of default 0.346
                  fp    33 fn     5 uh   727 us  1913    c 599.00
  mps04: MIN_PROB_STRENGTH = 0.4 instead of default 0.346
                  fp    23 fn     4 uh   836 us  2829    c 600.50

  nothing really too exciting here either.  we could possibly go
  up to require 0.4 for a minimum probability strength, since
  it seems to have the nice  effect of lowering FP *and* FN at
  the expense of a little more BAYES_50's on the uncertain cases.
  But I think tweaking S would be a better way to do that.


Overall: the code tweaks we have are still working well.  This is good, as I
was worried that spam had changed enough to make them counterproductive. One
exception is the invisible-tokens stuff, which is having no effect at all,
and that is probably a bug. ;)

I'm going to try a few more values for the S constant, which seems to reduce
FPs and FNs while increasing the BAYES_50 cases.  in my opinion it'd be more
valuable for us at this stage to reduce FPs and FNs, since we're not reliant on
Bayes alone.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4331] check Bayes tokenizer tweaks against current spam

Reply via email to