Re: [Bug 6155] generate new scores for 3.3.0 release

Justin Mason Thu, 22 Oct 2009 13:08:17 -0700

On Thu, Oct 22, 2009 at 20:35, Adam Katz <[email protected]> wrote:
> Henrik Krohns wrote:
>> I only have to look at my mail logs from today, and I see dozen of legimate
>> RDNS_NONE hits originating from real people. I'm happy to greylist it at
>> MTA, but not block directly.
>>
>> As said, it's a site policy. Some people use high FP BLs also happily. Many
>> people might not report FPs for one reason or another, but it doesn't mean
>> they don't exist.. I like to be on the safe side.
>
> The question is what defines "safe" and why is the score pinned to
> 0.1?  Isn't the whole point of the genetic algorithm to determine what
> "safe" value to assign it?  Who's to say that 0.2 isn't safe?  (I
> suppose there's no way to *cap* a GA score rather than just pin it?)


One thing we need to take into account is that some rules are harder
for senders to fix than others.  Whether or not their ISP gives them
rDNS is quite tricky to fix.  The GA can't take that into account, but
we can, by setting a score manually and locking it as non-mutable.

--j.

> SA is a system of probabilities.  We don't define ham as having 0 or
> fewer points.  Again, I cite the masscheck results.  Is 1.7% of the
> ham corpus bad?  What about MIME_HTML_ONLY's 3.7% ham, or
> RCVD_IN_SPAMCOP_BL's 1.3% ham or SUBJ_ALL_CAPS's 1.8%, ...?  All of
> those have GA-generated scores over 0.1.
>
> What about the fact that this only scores 0.8528% corpus overlap for
> ham scoring 4+? (like RDNS_NONE, MIME_HTML_ONLY's 3.7% ham overlap is
> mostly low-scoring ham, with only 1.5625% matching corpus ham at 4+).
>
> Even the latest scoring proposal here has this line:
>
>  score HTML_MESSAGE 2.199 0.838 1.473 0.511
>
> despite HTML_MESSAGE hitting 40.9% of the ham corpus.

agh!  that's a bug.

> Here are some that hit a larger portion of the ham corpus than of the
> spam corpus despite having positive scores in bugzilla attachment 4553
> (the latest scoring proposal) at
> https://issues.apache.org/SpamAssassin/attachment.cgi?id=4553
>
> MIME_QP_LONG_LINE
> FREEMAIL_FROM
> TVD_SPACE_RATIO
> EXTRA_MPART_TYPE
>
> (among others)
>
> These were found by applying this search to the front page at
> http://ruleqa.spamassassin.org (using a firefox regexp search add-on)
>
> /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w/
>
> In shell (guess who's bourne scripting is better than his perl?),
>
> elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if
> /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee
> rules.txt
>
> for rule in `perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }'
> <rules.txt`; do grep "^[^#]* $rule " /tmp/50_scores_newest.cf; done


Could you add a comment  to the rescoring bug (bug 6155) noting those
over-high scores?  HTML_MESSAGE at least should NOT be mutable like
that :(

-- 
--j.

Re: [Bug 6155] generate new scores for 3.3.0 release

Reply via email to