[Bug 5376] RFE: generate a "SpamAssassin Challenge" score-generation test

bugzilla-daemon Fri, 06 Jul 2007 04:58:42 -0700

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376






------- Additional Comments From [EMAIL PROTECTED]  2007-07-06 04:58 -------
(In reply to comment #8)
> Re: Bayes rules.
> No, they should not be immutable. If you want, we can require them to be 
> "sane"
> for some definition of sane. There's no compelling reason for them to be the
> exact values they are.

Possibly not the exact values they are right now.  But I think we have to
disagree that they need to be immutable; I really would prefer that they are. 
Other developers' thoughts would be welcome here...

> Re: Specifying Score ranges
> This is pretty tricky, I'm not sure what we can do here. The problem is that
> score ranges are inherently non-mathematical, really more of a shot in the 
> dark,
> and there's no real way to evaluate them. Having a different corpus form the
> test set is probably a better real-world test of the algorithm, but it also 
> adds
> a whole lot of luck (I think). If we were evaluating *algorithms* rather than
> submitted sets of scores, we could try something like a cross-fold validation
> but split into folds based on corpus as you suggest. I don't know if it would
> work in practice (or even in theory for that matter).

maybe that would be a good additional test step.  I agree 10fcv is
really the best way to check out an algorithm's workability...


> Re: Evaluation
> I'm not entirely sure what you're trying to say. Specifying a max FP rate and
> minimizing FNs given that rate is not the same as minimizing TCR. TCR builds 
> in
> a relative cost of FPs and FNs (namely lambda), and is probably a simpler
> criteria. I don't think we'll see really high FPs if we are trying to obtain
> optimal TCR with lambda = 50.
>
> (Remember that in terms of TCR(lambda=50), a score set with FP = 0.4% and FN =
> 0% is equivalent to FP = 0% and FN = 20% (assuming 50-50 ham/spam mix).)

ah, here's another problem with TCR I'd forgotten about.
it varies based on the size of the corpora:

: jm 372...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 1000 -ham 1000
# TCR(l=50):                  33.500000
: jm 373...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 2000 -ham 2000
# TCR(l=50):                  66.833333
: jm 374...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 3000 -ham 3000
# TCR(l=50):                  100.166667

(I can't believe I forgot about that!)
I think we should avoid it in general; if we can't compare results because
the corpus changes size, that's a bad thing.  I know it's nice to have a
single figure, but I haven't found a good one yet; nothing that's as
comprehensible or easy to use as FP%/FN%.


> Re: Fire and Forget
> In development of these algorithms, it's much easier to get a process down 
> then
> automate it later. If we don't want to see/evaluate the results before 
> requiring
> people to have a fully automated system, we might be wasting effort.

ok, maybe so.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5376] RFE: generate a "SpamAssassin Challenge" score-generation test

Reply via email to