[Bug 5376] RFE: generate a "SpamAssassin Challenge" score-generation test

bugzilla-daemon Wed, 04 Jul 2007 04:33:02 -0700

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376






------- Additional Comments From [EMAIL PROTECTED]  2007-07-04 04:32 -------
(In reply to comment #5)
> - BAYES_* are marked as immutable right now (IIRC). This really limits
> optimization in score sets 2 and 3.

I think we do want to keep them immutable, though -- when we made them
mutable, it caused lots of user confusion and FAQs.  it makes the rescoring
job a little harder, but dealing with user complaints is a real pain!

> - Score ranges need to be better defined. (Perhaps require that entries fit in
> current score ranges?)

I dunno, I don't think the current ones are all that great anyway ;) I'd bet
people could do better.  We've certainly tweaked them repeatedly ourselves
in the past... ;)

I take your point about wanting to block submissions with totally unrestricted
scores though; I tried to do that with what's in the current doc.  Can you
suggest wording that goes a bit further?

> If we don't clearly define/restrict score range, the best
> submission will probably be the one with the least restricted scores. Score
> ranges prevent scores from being over-optimized to our data set. Splitting our
> data set into training and test sets doesn't really catch this
> over-optimization, since both are part of our data set that has unique
> characteristics. (I'm sure there are technical terms for this, I just don't
> remember what they are...)

Maybe we'd need to use different submitter accounts for the train and test
sets?  So the test set is made up of mail from an entirely different
submitter or set of submitters?  We could do that...

> - I already have a copy of the test set (if I can find it). Does that make me
> ineligible? :-)

Just don't look at it ;)

Seriously though, I don't know how we could avoid that problem.  By definition,
any member of the SpamAssassin PMC can look at the test set if they want, due
to how the project governance and oversight works.  All we can do is work on
trust I think.

> - By requiring scores in the current format, we are eliminating a whole class 
> of
> scoring systems. For example, suppose I wanted to try a decision tree system 
> to
> detect spam based on SpamAssassin rules (this would obviously work very 
> poorly),
> it would be impossible to convert this into a set of scores.

I think this is acceptable.

I don't think we would want to replace the current scoring system with a more
complex system built around Bayesian probability combining, decision trees, or
anything else, without a *lot* more analysis and discussion, whereas this
"Challenge" is more likely to produce something like the perceptron; a drop-in
replacement for the offline rescoring component (I hope).

> - Our evaluation criteria is currently undefined. We need a clear, single
> measurement to decide on a winner. (In our research, we used TCR on the test 
> set
> with lambda = 50 as our "goal" criteria.) Depending on how/if we resolve the
> previous point, we need to set a threshold value (for example 5.0) as our sole
> test point.

Well, I had a thought during the 3.2.0 scoring.  We actually have two
constraints:

  - results have to be below a certain FP% rate (around 0.4%) if at all
    possible (we wound up going over this for set 0, but we had no
    alternative :( )

  - and as low an FN% rate as possible, given the above.

So in other words, that's two constraints: TCR *and* a cut-off point for the
FP% rate.  does that sound ok?

And for purposes of the challenge -- obviously, it'd need to be better than
what we currently get, too. ;)

> - Do you think people are actually going to be interested in this enough in
> order to devote a good chunk of time toward it? I hope so...

it'd make a good project I think!  I've received several emails in the
past about the idea, including Gordon's most recently, and it'd be natural
way for machine-learning students to get Google Summer of Code funding.

> Makes me think I should have submitted a talk to ApacheCon... it'd be a great
> way to kick off this contest.

yeah definitely...

> Oh, also, should we have any requirements on runtime? How automated the
> process needs to be when it is submitted. Our experiments in LR, for example,
> contain a fairly time consuming manual step right now :-) This sort of thing
> could probably be worked out after we select a system, but guidelines might
> be good here.

Good point.  It'd need to be "fire and forget" automated, as much as the GA is
right now.  Hand-tweaking is hard work, timeconsuming, and error-prone
in my experience...





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5376] RFE: generate a "SpamAssassin Challenge" score-generation test

Reply via email to