[Bug 6155] generate new scores for 3.3.0 release

bugzilla-daemon Tue, 22 Sep 2009 20:12:58 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155


--- Comment #42 from Daryl C. W. O'Shea <[email protected]> 2009-09-22 
20:12:25 PDT ---
I've uploaded my results, but they don't have bayes enabled.  Why, again,
aren't we reusing bayes results?

I've kicked off another round with bayes enabled (my net enabled check took
13.4 hours), I'm waiting on timing to see how long it'll take.  I may have to
setup a SQL server on the cluster to do it in a reasonable amount of time.

In any case, I don't think we have enough message results contributed yet for a
good scoreset.  We have way less than for 3.2.0, although from a larger number
of contributors.  Is there any chance we might see results from Theo?

(In reply to comment #15)
> Should I bother to continue recruiting more masscheck participants after this
> rescore?

I would.  A larger number of people submitting from *clean* corpora will allow
us to provide updated scores more often.  As it is now the scores I'm
generating now (well broken right now, but I'll fix it soon) swing quite a bit.
 I suspect it's due too not enough submitters and not enough messages.


(In reply to comment #17)
> > the base ruleset (non-sandbox rules) won't change scores, so this is 
> > important.
> > For nightly masschecks, the only scores affected will be those of sandbox
> > rules.  So only about 1/2 of the ruleset, I'd reckon.
> 
> I am curious, do you remember the original reason for this design decision?

I felt that we didn't have a large enough nightly/weekly corpus to reliable
change all of the scores.  I could generate two versions of the scores... with
and without locking the base set of scores.

> Might there be value in making the entire ruleset scores affected by nightly
> masshecks?

I think we need a larger nightly/weekly corpus before we do this.

(In reply to comment #18)
> iirc, the risk is that a small set of corpora (e.g. a few people take a week
> off) could cause the entire ruleset to be skewed incorrectly.  This way at
> least only the most recent (sandbox) rules would be affected, so it's a bit
> safer.

Even when all of the regular contributors submitted their results the corpus
wasn't that large, so I didn't want to throw away the scores based on the much
large corpus we had for 3.2.0

> It's also faster to generate the scores, but this isn't so much of an issue
> now, as our main machine is quite beefy...

I can do it either way... cycles wasn't an issue.

> There may have   been other reasons, too, but I can't find the mails :(

I probably only sent one about the topic.  Some terse comments in the commit
messages for that code.

(In reply to comment #25)
> Daryl, is there a URL to your weekly scores?

Still a little broken on my end, but:

http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/scores/

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

Reply via email to