I figured it does something like that, probably fine for most of those rules that don't hit much mail at all. Then we have stuff that hit 20%+ of ham like STYLE_GIBBERISH, probably the rescorer should take that more into account instead of just "crunching numbers". :-) It's not like the whole world uses 5 as a baseline, people might also have all kinds of local poison pill rules. 8-10 seems quite ok to use and I remember some wiki page even recommending that.
On Sun, Jun 16, 2019 at 08:42:57AM +0100, Paul Stead wrote: > So let's look at the following rule which isn't promotable in QA: > [1]https://ruleqa.spamassassin.org/20190615-r1861371-n/URI_WP_HACKED_2/detail > > This has a publish tflag. > > Because of the publish tflag it is included in the active.list > > Because it's in the active.list it is considered for rescoring. > > When it is rescored, the iterative process scores against both ham and spam > in several thousand iterations for the rules from the rev# of that day. > During these iterations the score that came out triggered minimal FPs (ham > mail > 5.0) and helped towards the spam score the best. > > The rescore seems to be doing the right thing in my opinion. > It might show scores for rules that hit more ham than spam on the qa site, > but during the check of the corpus the score generated triggered minimal > emails hitting FPs. > > > Paul > > > On Sat, 15 Jun 2019 at 18:06, John Hardin <[2][email protected]> wrote: > > On Fri, 14 Jun 2019, Henrik K wrote: > > > PS. John, all these rules from your sandbox seem to have very broken > > scores, could you perhaps add informative scores to > > [3]73_sandbox_manual_scores.cf for these? Atleast that method should > work > > 100% for now.. > > > > FROM_IN_TO_AND_SUBJ 2.199 > > OBFU_TEXT_ATTACH 1.699 > > MIME_NO_TEXT 1.542 > > AD_PREFS 1.399 > > URI_WP_HACKED_2 1.304 > > STYLE_GIBBERISH 1.111 > > UC_GIBBERISH_OBFU 1.000 > > LUCRATIVE 1.000 > > HEXHASH_WORD 1.000 > > FROM_WORDY 1.000 > > AC_HTML_NONSENSE_TAGS 1.000 > > LONG_HEX_URI 0.896 > > FROM_PAYPAL_SPOOF 0.727 > > Not all of those are in my sandbox. For example, AC_HTML_NONSENSE_TAGS is > in KAM's. > > I spent some time today (which I did not have yesterday) to review and > update the tuning on many of those rules to improve their S/O. > > I also tried adding scores to [4]73_sandbox_manual_scores.cf for them to > suppress the net scores until those changes can be evaluated by the weekly > masscheck, but ran into a problem - see SA bug 7721. > > The tuning should minimize the problem from the stale net scores, so I'm > reluctant to alter their global scores, except for AD_PREFS, which is a > very simple rule that seems to be falling afoul of a lot of "legitimate" > marketing emails (i.e. actually subscribed to) in the masscheck ham > corpora and thus can't really be tuned. > > > -- > John Hardin KA7OHZ [5]http://www.impsec.org/~jhardin/ > [6][email protected] FALaholic #11174 pgpk -a [7] > [email protected] > key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 > ----------------------------------------------------------------------- > Are you a mildly tech-literate politico horrified by the level of > ignorance demonstrated by lawmakers gearing up to regulate online > technology they don't even begin to grasp? Cool. Now you have a > tiny glimpse into a day in the life of a gun owner. -- Sean Davis > ----------------------------------------------------------------------- > 3 days until SWMBO's Birthday > > > References: > > [1] https://ruleqa.spamassassin.org/20190615-r1861371-n/URI_WP_HACKED_2/detail > [2] mailto:[email protected] > [3] http://73_sandbox_manual_scores.cf/ > [4] http://73_sandbox_manual_scores.cf/ > [5] http://www.impsec.org/~jhardin/ > [6] mailto:[email protected] > [7] mailto:[email protected]
