Re: [Bug 6155] generate new scores for 3.3.0 release

Adam Katz Thu, 22 Oct 2009 12:36:02 -0700

Henrik Krohns wrote:
> I only have to look at my mail logs from today, and I see dozen of legimate
> RDNS_NONE hits originating from real people. I'm happy to greylist it at
> MTA, but not block directly.
> 
> As said, it's a site policy. Some people use high FP BLs also happily. Many
> people might not report FPs for one reason or another, but it doesn't mean
> they don't exist.. I like to be on the safe side.


The question is what defines "safe" and why is the score pinned to
0.1?  Isn't the whole point of the genetic algorithm to determine what
"safe" value to assign it?  Who's to say that 0.2 isn't safe?  (I
suppose there's no way to *cap* a GA score rather than just pin it?)

SA is a system of probabilities.  We don't define ham as having 0 or
fewer points.  Again, I cite the masscheck results.  Is 1.7% of the
ham corpus bad?  What about MIME_HTML_ONLY's 3.7% ham, or
RCVD_IN_SPAMCOP_BL's 1.3% ham or SUBJ_ALL_CAPS's 1.8%, ...?  All of
those have GA-generated scores over 0.1.

What about the fact that this only scores 0.8528% corpus overlap for
ham scoring 4+? (like RDNS_NONE, MIME_HTML_ONLY's 3.7% ham overlap is
mostly low-scoring ham, with only 1.5625% matching corpus ham at 4+).

Even the latest scoring proposal here has this line:

  score HTML_MESSAGE 2.199 0.838 1.473 0.511

despite HTML_MESSAGE hitting 40.9% of the ham corpus.

Here are some that hit a larger portion of the ham corpus than of the
spam corpus despite having positive scores in bugzilla attachment 4553
(the latest scoring proposal) at
https://issues.apache.org/SpamAssassin/attachment.cgi?id=4553

MIME_QP_LONG_LINE
FREEMAIL_FROM
TVD_SPACE_RATIO
EXTRA_MPART_TYPE

(among others)

These were found by applying this search to the front page at
http://ruleqa.spamassassin.org (using a firefox regexp search add-on)

/(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w/

In shell (guess who's bourne scripting is better than his perl?),

elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if
/(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee
rules.txt

for rule in `perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }'
<rules.txt`; do grep "^[^#]* $rule " /tmp/50_scores_newest.cf; done

Re: [Bug 6155] generate new scores for 3.3.0 release

Reply via email to