https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #135 from Adam Katz <[email protected]> 2009-10-26 16:27:56 UTC --- Created an attachment (id=4561) --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4561) Checker for rules that match more ham than spam I've updated my checker to an actual perl script (still uses elinks as I don't feel like learning LWP and then parsing HTML). I've attached the checker, which can be run with custom parameters for a different ruleset, ham threshold, or minimum difference for ham:spam ratio. Here's the current output, listing all rules that hit 1+% of the ham corpus or that hit 0.05% more of the ham corpus than of the spam corpus. H^2/S HAM% SPAM% Score in attachment 4558 Rule 331.9 0.3319 0.0010 0 OBSCURED_EMAIL 117.4 4.8566 0.2009 -0.001 SPF_HELO_PASS 88.52 5.5735 0.3509 -0.001 SPF_PASS 85.61 0.2226 0.0026 0.000 2.099 0.001 1.212 MISSING_MIME_HB_SEP 76.18 0.7085 0.0093 0.001 0.001 0.699 0.699 TVD_RCVD_SPACE_BRACKET 66.19 0.2780 0.0042 1.145 1.542 1.912 2.400 FUZZY_CPILL 49.98 1.0676 0.0228 0.001 MSGID_MULTIPLE_AT 31.82 0.1496 0.0047 1.494 1.699 1.591 1.516 X_IP 21.86 0.1465 0.0067 0 SUBJECT_FUZZY_TION 20.40 15.6218 11.9604 0.001 FREEMAIL_FROM 20.00* 40.9055 83.6301 0.001 HTML_MESSAGE 17.10 0.1710 0 1.222 0.001 0.082 0.476 MIME_BOUND_DIGITS_15 12.95 0.0609 0.0047 0 HTML_IFRAME_SRC 12.52 0.0714 0.0057 0 FORGED_IMS_TAGS 11.56 0.0659 0.0057 0.001 0.001 0.605 0.378 HTML_NONELEMENT_30_40 10.83 0.1127 0.0104 0.033 0.001 0.365 0.413 WEIRD_PORT 10.18 0.3494 0.0343 2.205 0.174 1.299 1.806 FRT_SOMA2 9.721 0.8934 0.0919 1.499 0.419 0.904 0.798 MIME_BASE64_BLANKS 8.996 0.2474 0.0275 0.987 0.750 0.943 1.318 CTYPE_001C_B 8.918 0.1525 0.0171 0.001 2.499 0.268 0.516 DRUGS_MUSCLE 8.373 0.0829 0.0099 0.003 0.978 0.100 1.515 TVD_FW_GRAPHIC_NAME_LONG 8.016 0.1956 0.0244 0.001 0.020 0.001 1.799 MIME_BASE64_TEXT 6.850 0.0685 0 0 HTML_NONELEMENT_40_50 5.404 0.5356 0.0991 0 1.200 0 2.514 SPF_HELO_FAIL 4.237 0.1585 0.0374 2.199 2.199 1.246 2.090 WEIRD_QUOTING 4.159 3.8908 3.6392 0.001 MIME_QP_LONG_LINE 3.483 0.8570 0.2460 1.799 0.572 1.182 1.138 HTML_IMAGE_RATIO_06 3.219 1.2399 0.4775 1.0 EXTRA_MPART_TYPE 2.913* 12.1047 50.2891 0 1.1 0 0.7 RDNS_NONE 2.839 0.1164 0.0410 0.001 2.185 1.936 0.476 FRT_SOMA 2.751 0.1172 0.0426 0.1 ANY_BOUNCE_MESSAGE 2.417 0.6787 0.2808 0.539 0.001 0.332 0.488 MIME_HTML_MOSTLY 2.370 0.1010 0.0426 0.1 BOUNCE_MESSAGE 2.078 0.5534 0.2663 1.899 0.496 0.950 0.445 HTML_IMAGE_RATIO_08 1.899 1.2077 0.7677 0.001 TVD_SPACE_RATIO 1.726 0.3227 0.1869 0.023 0.887 0.000 0.417 UPPERCASE_50_75 1.517 0.9658 0.6364 2.801 2.080 1.780 3.387 DATE_IN_PAST_96_XX 1.269 0.4224 0.3327 0.000 0.001 0.264 0.001 HTML_FONT_SIZE_LARGE 1.151 0.5492 0.4770 2.260 0.742 1.199 0.640 MPART_ALT_DIFF 0.913* 1.8488 3.7425 1.154 1.677 1.198 1.453 SUBJ_ALL_CAPS 0.703* 1.3317 2.5216 0.001 UNPARSEABLE_RELAY 0.278* 3.7480 50.4848 2.199 0.955 1.215 0.549 MIME_HTML_ONLY 0.121* 1.2540 12.9472 0 1.322 0 1.237 RCVD_IN_BL_SPAMCOP_NET (Anything asterisked is included because it matched >1% of the ham corpus but matched a larger percent of the spam corpus while everything else matched a larger percent of the ham corpus than the spam corpus.) Mark's fixes solved the immediate issues raised earlier, so I decided to order this by the ratio of percentage of ham corpus hit to percentage of spam corpus hit, but that under-emphasized the ham hits, so I then multiplied that by the ham percentage again (unless the percent was under 1). It's easy enough to browse for non-zero ham% hits. Any rule with a ratio over 1.000 is a problem when scored positively unless it is exempted for applying to popular spam patterns that the corpus is known to lack. For completeness, this list includes all tests that hit at least 1% of the ham corpus (thus the presence of HTML_MESSAGE, RDNS_NONE, and the four tests with ratios under 1.0). -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.
