https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #135 from Adam Katz <[email protected]> 2009-10-26 16:27:56 UTC 
---
Created an attachment (id=4561)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4561)
Checker for rules that match more ham than spam

I've updated my checker to an actual perl script (still uses elinks as I don't
feel like learning LWP and then parsing HTML).  I've attached the checker,
which can be run with custom parameters for a different ruleset, ham threshold,
or minimum difference for ham:spam ratio.  Here's the current output, listing
all rules that hit 1+% of the ham corpus or that hit 0.05% more of the ham
corpus than of the spam corpus.

H^2/S    HAM%    SPAM%    Score in attachment 4558   Rule
331.9    0.3319  0.0010   0                          OBSCURED_EMAIL
117.4    4.8566  0.2009   -0.001                     SPF_HELO_PASS
88.52    5.5735  0.3509   -0.001                     SPF_PASS
85.61    0.2226  0.0026   0.000 2.099 0.001 1.212    MISSING_MIME_HB_SEP
76.18    0.7085  0.0093   0.001 0.001 0.699 0.699    TVD_RCVD_SPACE_BRACKET
66.19    0.2780  0.0042   1.145 1.542 1.912 2.400    FUZZY_CPILL
49.98    1.0676  0.0228   0.001                      MSGID_MULTIPLE_AT
31.82    0.1496  0.0047   1.494 1.699 1.591 1.516    X_IP
21.86    0.1465  0.0067   0                          SUBJECT_FUZZY_TION
20.40   15.6218 11.9604   0.001                      FREEMAIL_FROM
20.00*  40.9055 83.6301   0.001                      HTML_MESSAGE
17.10    0.1710  0        1.222 0.001 0.082 0.476    MIME_BOUND_DIGITS_15
12.95    0.0609  0.0047   0                          HTML_IFRAME_SRC
12.52    0.0714  0.0057   0                          FORGED_IMS_TAGS
11.56    0.0659  0.0057   0.001 0.001 0.605 0.378    HTML_NONELEMENT_30_40
10.83    0.1127  0.0104   0.033 0.001 0.365 0.413    WEIRD_PORT
10.18    0.3494  0.0343   2.205 0.174 1.299 1.806    FRT_SOMA2
9.721    0.8934  0.0919   1.499 0.419 0.904 0.798    MIME_BASE64_BLANKS
8.996    0.2474  0.0275   0.987 0.750 0.943 1.318    CTYPE_001C_B
8.918    0.1525  0.0171   0.001 2.499 0.268 0.516    DRUGS_MUSCLE
8.373    0.0829  0.0099   0.003 0.978 0.100 1.515    TVD_FW_GRAPHIC_NAME_LONG
8.016    0.1956  0.0244   0.001 0.020 0.001 1.799    MIME_BASE64_TEXT
6.850    0.0685  0        0                          HTML_NONELEMENT_40_50
5.404    0.5356  0.0991   0 1.200 0 2.514            SPF_HELO_FAIL
4.237    0.1585  0.0374   2.199 2.199 1.246 2.090    WEIRD_QUOTING
4.159    3.8908  3.6392   0.001                      MIME_QP_LONG_LINE
3.483    0.8570  0.2460   1.799 0.572 1.182 1.138    HTML_IMAGE_RATIO_06
3.219    1.2399  0.4775   1.0                        EXTRA_MPART_TYPE
2.913*  12.1047 50.2891   0 1.1 0 0.7                RDNS_NONE
2.839    0.1164  0.0410   0.001 2.185 1.936 0.476    FRT_SOMA
2.751    0.1172  0.0426   0.1                        ANY_BOUNCE_MESSAGE
2.417    0.6787  0.2808   0.539 0.001 0.332 0.488    MIME_HTML_MOSTLY
2.370    0.1010  0.0426   0.1                        BOUNCE_MESSAGE
2.078    0.5534  0.2663   1.899 0.496 0.950 0.445    HTML_IMAGE_RATIO_08
1.899    1.2077  0.7677   0.001                      TVD_SPACE_RATIO
1.726    0.3227  0.1869   0.023 0.887 0.000 0.417    UPPERCASE_50_75
1.517    0.9658  0.6364   2.801 2.080 1.780 3.387    DATE_IN_PAST_96_XX
1.269    0.4224  0.3327   0.000 0.001 0.264 0.001    HTML_FONT_SIZE_LARGE
1.151    0.5492  0.4770   2.260 0.742 1.199 0.640    MPART_ALT_DIFF
0.913*   1.8488  3.7425   1.154 1.677 1.198 1.453    SUBJ_ALL_CAPS
0.703*   1.3317  2.5216   0.001                      UNPARSEABLE_RELAY
0.278*   3.7480 50.4848   2.199 0.955 1.215 0.549    MIME_HTML_ONLY
0.121*   1.2540 12.9472   0 1.322 0 1.237            RCVD_IN_BL_SPAMCOP_NET

(Anything asterisked is included because it matched >1% of the ham corpus but
matched a larger percent of the spam corpus while everything else matched a
larger percent of the ham corpus than the spam corpus.)

Mark's fixes solved the immediate issues raised earlier, so I decided to order
this by the ratio of percentage of ham corpus hit to percentage of spam corpus
hit, but that under-emphasized the ham hits, so I then multiplied that by the
ham percentage again (unless the percent was under 1).  It's easy enough to
browse for non-zero ham% hits.

Any rule with a ratio over 1.000 is a problem when scored positively unless it
is exempted for applying to popular spam patterns that the corpus is known to
lack.  For completeness, this list includes all tests that hit at least 1% of
the ham corpus (thus the presence of HTML_MESSAGE, RDNS_NONE, and the four
tests with ratios under 1.0).

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to