I figured it does something like that, probably fine for most of those rules
that don't hit much mail at all.  Then we have stuff that hit 20%+ of ham
like STYLE_GIBBERISH, probably the rescorer should take that more into
account instead of just "crunching numbers".  :-) It's not like the whole
world uses 5 as a baseline, people might also have all kinds of local poison
pill rules.  8-10 seems quite ok to use and I remember some wiki page even
recommending that.


On Sun, Jun 16, 2019 at 08:42:57AM +0100, Paul Stead wrote:
> So let's look at the following rule which isn't promotable in QA: 
> [1]https://ruleqa.spamassassin.org/20190615-r1861371-n/URI_WP_HACKED_2/detail
> 
> This has a publish tflag.
> 
> Because of the publish tflag it is included in the active.list
> 
> Because it's in the active.list it is considered for rescoring.
> 
> When it is rescored, the iterative process scores against both ham and spam 
> in several thousand iterations for the rules from the rev# of that day.
> During these iterations the score that came out triggered minimal FPs (ham 
> mail > 5.0) and helped towards the spam score the best.
> 
> The rescore seems to be doing the right thing in my opinion.
> It might show scores for rules that hit more ham than spam on the qa site, 
> but during the check of the corpus the score generated triggered minimal 
> emails hitting FPs.
> 
> 
> Paul
> 
> 
> On Sat, 15 Jun 2019 at 18:06, John Hardin <[2][email protected]> wrote:
> 
>     On Fri, 14 Jun 2019, Henrik K wrote:
> 
>     > PS.  John, all these rules from your sandbox seem to have very broken
>     > scores, could you perhaps add informative scores to
>     > [3]73_sandbox_manual_scores.cf for these?  Atleast that method should
>     work
>     > 100% for now..
>     >
>     > FROM_IN_TO_AND_SUBJ 2.199
>     > OBFU_TEXT_ATTACH 1.699
>     > MIME_NO_TEXT 1.542
>     > AD_PREFS 1.399
>     > URI_WP_HACKED_2 1.304
>     > STYLE_GIBBERISH 1.111
>     > UC_GIBBERISH_OBFU 1.000
>     > LUCRATIVE 1.000
>     > HEXHASH_WORD 1.000
>     > FROM_WORDY 1.000
>     > AC_HTML_NONSENSE_TAGS 1.000
>     > LONG_HEX_URI 0.896
>     > FROM_PAYPAL_SPOOF 0.727
> 
>     Not all of those are in my sandbox. For example, AC_HTML_NONSENSE_TAGS is
>     in KAM's.
> 
>     I spent some time today (which I did not have yesterday) to review and
>     update the tuning on many of those rules to improve their S/O.
> 
>     I also tried adding scores to [4]73_sandbox_manual_scores.cf for them to
>     suppress the net scores until those changes can be evaluated by the weekly
>     masscheck, but ran into a problem - see SA bug 7721.
> 
>     The tuning should minimize the problem from the stale net scores, so I'm
>     reluctant to alter their global scores, except for AD_PREFS, which is a
>     very simple rule that seems to be falling afoul of a lot of "legitimate"
>     marketing emails (i.e. actually subscribed to) in the masscheck ham
>     corpora and thus can't really be tuned.
> 
> 
>     --
>       John Hardin KA7OHZ                    [5]http://www.impsec.org/~jhardin/
>       [6][email protected]    FALaholic #11174     pgpk -a [7]
>     [email protected]
>       key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
>     -----------------------------------------------------------------------
>        Are you a mildly tech-literate politico horrified by the level of
>        ignorance demonstrated by lawmakers gearing up to regulate online
>        technology they don't even begin to grasp? Cool. Now you have a
>        tiny glimpse into a day in the life of a gun owner.   -- Sean Davis
>     -----------------------------------------------------------------------
>       3 days until SWMBO's Birthday
> 
> 
> References:
> 
> [1] https://ruleqa.spamassassin.org/20190615-r1861371-n/URI_WP_HACKED_2/detail
> [2] mailto:[email protected]
> [3] http://73_sandbox_manual_scores.cf/
> [4] http://73_sandbox_manual_scores.cf/
> [5] http://www.impsec.org/~jhardin/
> [6] mailto:[email protected]
> [7] mailto:[email protected]

Reply via email to