So, it looks like we need to issue a pre4 with scores set properly and restart.
Also, please look at Bug 4461 which will help with folks who have mixed corpus with some with X-Spam-Status and some without. We might even be able to get a couple of bugs that either already have 3 +1 votes, or nearly do. For the record, I've attached the IRC discussion from earlier this evening for those who were not in on the discussion. Michael [07-Jul-2005 17:30:47] <jmason> so 13% of the rules were zeroed. doh!! [07-Jul-2005 17:30:48] * quinlan beats head against wall [07-Jul-2005 17:30:53] * jmason wears paper bag [07-Jul-2005 17:30:56] <quinlan> harder [07-Jul-2005 17:31:12] <quinlan> 78 out of 579 rules that are not zeroed [07-Jul-2005 17:31:17] <quinlan> zeroed as in disabled [07-Jul-2005 17:31:45] <quinlan> probably more like 70 out of 540, but whatever [07-Jul-2005 17:32:08] <jmason> that are zeroed, or are not zeroed? [07-Jul-2005 17:32:32] <quinlan> let me just check the mutable ones [07-Jul-2005 17:33:53] <quinlan> 78 out of 528 [07-Jul-2005 17:34:01] <quinlan> 15% [07-Jul-2005 17:35:58] <quinlan> hmmmm [07-Jul-2005 17:36:29] <quinlan> bear in mind that's 15% of set3 rules that are no n-zero in some other set [07-Jul-2005 17:36:34] <jmason> of course, they were the *crappiest* 15% [07-Jul-2005 17:36:36] <quinlan> so, this is bad [07-Jul-2005 17:36:44] <quinlan> crappiest when in bayes+net mode [07-Jul-2005 17:37:38] <jmason> 15% that were nonzero in other sets. argh, yes, t hat's not good [07-Jul-2005 17:37:57] <jmason> how's about an experimental mass-check with all ru les enabled, to see how big the diff is? [07-Jul-2005 17:38:08] <jmason> (on the same subset of the mail corpus, of course) [07-Jul-2005 17:38:25] <quinlan> someone finished their mass-check ? [07-Jul-2005 17:38:47] <jmason> yeah, I have [07-Jul-2005 17:38:57] <quinlan> I nominate jmason [07-Jul-2005 17:38:58] <cthielen> quinlan, I did and have submitted, but am redoin g it [07-Jul-2005 17:39:15] <quinlan> cthielen: I'd kill it and wait for instructions. [07-Jul-2005 17:39:43] <quinlan> well, let it finish, but I'mm 94% sure we'll have to restart [07-Jul-2005 17:39:51] <cthielen> mine completes pretty quickly... i'd experiment but I'm going out of town tomorrow for the weekend [07-Jul-2005 17:40:09] <quinlan> wasn't there some other problem we glossed over? [07-Jul-2005 17:40:11] <quinlan> oh yeah [07-Jul-2005 17:40:15] <jmason> alright, I'll gen a new log [07-Jul-2005 17:40:18] <quinlan> reuse when X-Spam-Status is not present [07-Jul-2005 17:40:50] <quinlan> I think that's an easier problem to solve. [07-Jul-2005 17:40:59] <quinlan> we remove the entire rule-zeroing logic. [07-Jul-2005 17:41:17] <quinlan> and then we just disable the reuse replacement co de when there's no X-Spam-Status [07-Jul-2005 17:41:24] <quinlan> much slower, but fixes problem [07-Jul-2005 17:41:50] <quinlan> rule-zeroing in mass-check --reuse, to be specifi c [07-Jul-2005 17:43:12] <quinlan> just to ask.... is there an easy way to disable a rule on a per-message basis? [07-Jul-2005 17:43:29] <quinlan> I'm not touching the scores from mass-check [07-Jul-2005 17:45:07] <duncf> quinlan: i think the only way is to zero the score on a per-message basis, and i have no idea how we'd do that [07-Jul-2005 17:45:31] <Herk> copy config [07-Jul-2005 17:46:25] <pasteling> "quinlan" at 209.204.178.122 pasted "patch to f ix mass-check" (39 lines, 1.6K) at http://sial.org/pbot/11606 [07-Jul-2005 17:47:13] <henry> I'm absolutely shattered [07-Jul-2005 17:47:20] <henry> keep me informed of what's going on [07-Jul-2005 17:47:24] <henry> good night! [07-Jul-2005 17:47:45] *** henry has quit IRC [07-Jul-2005 17:48:33] *** DavidMar has quit IRC [07-Jul-2005 17:52:05] <jmason> Herk: +1 [07-Jul-2005 17:52:20] <jmason> we have to make --reuse idiot-proof, since I am an idiot [07-Jul-2005 17:58:06] <Herk> ok how about this [07-Jul-2005 17:58:20] <Herk> someone on the fly [07-Jul-2005 17:58:23] <Herk> somewhat that is [07-Jul-2005 17:59:05] <Herk> on startup, right after the creation of $spamtest, w e call copy_config [07-Jul-2005 17:59:32] <Herk> then, we do the logic to dump out mass_prefs and cal l read_scoreonly_config(mass_prefs) [07-Jul-2005 17:59:40] <Herk> then copy_config for that [07-Jul-2005 18:00:02] <Herk> then, in wanted, depending on if we have a status li ne we pick the correct config [07-Jul-2005 18:02:19] <Herk> probably some logic in there to keep track of which config was currently loaded so you don't have to perform the switch every time [07-Jul-2005 18:02:39] <jmason> +1 [07-Jul-2005 18:06:19] <jmason> I can't see any problems with that. it'd be slow er, but probably a little faster overall given less DNS lookups involved [07-Jul-2005 18:08:28] <duncf> jmason: im an idiot too [07-Jul-2005 18:09:19] *** DavidMar has joined #spamassassin [07-Jul-2005 18:09:59] <Herk> where does mass_prefs get read in? [07-Jul-2005 18:10:12] <jmason> dunno [07-Jul-2005 18:11:17] <Herk> oh, nevermind [07-Jul-2005 18:13:27] *** cthielen has quit IRC [07-Jul-2005 18:23:48] *** duncf has quit IRC [07-Jul-2005 18:24:04] <pasteling> "Herk" at 66.143.177.176 pasted "Untested mass- check patch, but this is what I'm thinking" (101 lines, 3K) at http://sial.org/pbot /11612 [07-Jul-2005 18:25:19] <quinlan> back [07-Jul-2005 18:25:34] <jmason> $reuse_rules_loaded_p needs to be initted [07-Jul-2005 18:25:45] <Herk> k [07-Jul-2005 18:25:51] <jmason> other than that, I like it [07-Jul-2005 18:25:52] <quinlan> Herk is evil [07-Jul-2005 18:26:13] <Herk> I need something else in there for when not running with opt_reuse, so one other little logic check [07-Jul-2005 18:26:27] <jmason> btw I'm thinking we should have some kind of magic symbols that 3.1.x or 3.2 can put in X-Spam-Status to indicate what stuf exactly c an be reused... [07-Jul-2005 18:26:44] <quinlan> jmason: no [07-Jul-2005 18:26:53] <jmason> I know it's inelegant, but the alternative -- just hoping that people had rules enabled -- is too risky right now I think [07-Jul-2005 18:28:12] <quinlan> this will generate the best scores possible [07-Jul-2005 18:28:14] <quinlan> fix the bug [07-Jul-2005 18:28:17] <quinlan> enhance --reuse [07-Jul-2005 18:28:34] <quinlan> sorry, topic shift [07-Jul-2005 18:28:48] <quinlan> re: inelegant - just rename rule if it changes ma ssively [07-Jul-2005 18:29:07] <quinlan> the reuse logic handles incidental renames as wel l [07-Jul-2005 18:29:14] <quinlan> you can specify more than one old name [07-Jul-2005 18:29:35] <jmason> quinlan: yes, but what if I had a broken version o f Net::DNS installed for a while between Jan 4 and Mar 20th? [07-Jul-2005 18:29:54] <quinlan> well, then you reproduce that condition in your r eal-time mass-check [07-Jul-2005 18:30:01] <quinlan> which is probably a *good* thing [07-Jul-2005 18:30:18] <jmason> too much work, and too little idiot-proofing. you expect everyone to remember that? [07-Jul-2005 18:30:26] <quinlan> NO [07-Jul-2005 18:30:43] <quinlan> I mean, you reproduce the temporary DNS failure b y losing those hits as reuse operates now [07-Jul-2005 18:30:57] <quinlan> for example, let's say SURBL goes down once a mon th [07-Jul-2005 18:31:26] <quinlan> (for a day) ... our network score set should have that day reflected in the generated scores [07-Jul-2005 18:33:07] <jmason> yes, but let's say it was just some crash or misco nfig on *my* end [07-Jul-2005 18:33:17] <jmason> why should everyone else's scores reflect that? [07-Jul-2005 18:34:14] <quinlan> incidentalness should be reflected, that's all [07-Jul-2005 18:34:25] <quinlan> you don't want to optimize around everything work ing all the time [07-Jul-2005 18:34:42] <quinlan> we have non-net rules for a reason :-) [07-Jul-2005 18:35:09] <jmason> ok, but I'm talking in this scenario about no DNS rules at all for 1/3 of my imaginary corpus [07-Jul-2005 18:36:29] <jmason> hm. well, I could settle for, let's say, just rec ording in X-Spam-Status if -L is in use, or not [07-Jul-2005 18:36:47] <jmason> fwiw: I have in the past switched between -L on an d off on my spamd server [07-Jul-2005 18:40:00] * Herk wonders if we should have some sort of reuse=yes or reuse=no line in the mass-check logs [07-Jul-2005 18:40:59] <jmason> that could work, you know [07-Jul-2005 18:41:17] <jmason> and various heuristics to determine if it should b e reusable, based on local_tests_only() [07-Jul-2005 18:44:04] <jmason> yeah, that'd work [07-Jul-2005 19:07:26] <Herk> ok, I'm gonna have to finish it up later, I think it 's done, but needs to be tested, should I just attach to 4461 and let y'all test? [07-Jul-2005 19:08:56] <Herk> @sabug 4461 [07-Jul-2005 19:09:00] <sabot> Herk: SpamAssassin bug #4461: mass-check --reuse ca nnot deal with previously-unscanned mail Product: Spamassassin, Component: Masses, Severity: major, Assigned to: [email protected], Status: NEW http://bugzi lla.spamassassin.org/show_bug.cgi?id=4461 [07-Jul-2005 19:13:21] <jmason> btw mass-check running now with all scores unzeroe d [07-Jul-2005 19:19:49] <quinlan> Herk: sure [07-Jul-2005 19:19:53] <quinlan> Herk: evil++; [07-Jul-2005 20:16:40] <quinlan> jmason: I'm actually fed up with 50_scores.cf [07-Jul-2005 20:16:52] <quinlan> we should have two files: one for development and one for production [07-Jul-2005 20:17:11] <quinlan> the development one is edited and is the source f or the production one, the production one is 100% machine generated by scripts [07-Jul-2005 22:23:21] <Herk> so, are we planning on restarting mass-checks with z eroed scores? [07-Jul-2005 22:24:43] <quinlan> unzeroed plus your patch would be optimal [07-Jul-2005 22:25:16] <quinlan> given our FP rate and release cycle, I think it w ould pay off to get it right now. [07-Jul-2005 22:25:35] <Herk> yeah, patch should be good to go, I'll double check and mark for review [07-Jul-2005 22:25:42] <quinlan> we should try to get this process down pat such t hat we can re-run more often [07-Jul-2005 22:25:59] <quinlan> I think splitting 50_scores.cf into two or more f iles would help a lot [07-Jul-2005 22:26:04] <Herk> every weekend :) [07-Jul-2005 22:26:27] <quinlan> maybe once every 3 months would be good for us ;- ) [07-Jul-2005 22:26:45] <quinlan> I hate updating my corpus [07-Jul-2005 22:26:50] <Herk> we need to document how to run the perceptron a litt le better [07-Jul-2005 22:26:54] <quinlan> yes [07-Jul-2005 22:27:21] <quinlan> 50_scores_gen.cf and 50_scores_src.cf [07-Jul-2005 22:27:44] <quinlan> except I'd name them 50_scores.cf and 51_perceptr on.cf for easier completion
signature.asc
Description: OpenPGP digital signature
