[Bug 5376] RFE: generate a "SpamAssassin Challenge" score-generation test

2022-03-06 Thread bugzilla-daemon
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=5376

Henrik Krohns  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||apa...@hege.li
 Resolution|--- |WONTFIX

--- Comment #24 from Henrik Krohns  ---
Closing ancient stale bug.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-22 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-22 04:42 ---
I think we should go ahead with using FP% and FN% -- a double-figure metric --
instead of any single figure metric.  I don't think any of the single-figure
metrics perform well enough in all situations to match FP%/FN%'s expressiveness.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-14 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-14 05:32 ---
lam() avoids the problem noted in comment 13, but has another problem; it
doesn't have any concept of an FP being worse than an FN.  This means that e.g.
fp%=1.5 fn%=5 and fp%=5 fn%=1.5 have the same score, whereas the latter is much
worse for our users.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-14 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-14 06:03 ---
here's a version of lam() with a lambda calculation...

#!/usr/bin/perl
# http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376#c16
my ($lambda, $fppc, $fnpc, $nspam, $nham) = @ARGV;
my $fprate = ((($fppc * $nham) / 100) + 0.5) /  ($nham + 0.5);
my $fnrate = ((($fnpc * $nspam) / 100) + 0.5) / ($nspam + 0.5);
sub logit { my $p = shift; return log($p / (1-$p)); }
sub invlogit { my $x = shift; return exp($x) / (1 + exp($x)); }
my $llam = invlogit (($lambda * logit($fprate) + logit($fnrate)) / ($lambda + 
1));
print Llam(l=$lambda, fp=$fppc, fn=$fnpc, ns=$nspam, nh=$nham): $llam\n;



some results:

Llam(l=10, fp=1, fn=5, ns=1, nh=1): 0.011653428823489
Llam(l=10, fp=2, fn=5, ns=1, nh=1): 0.0218100293576761
Llam(l=10, fp=5, fn=1, ns=1, nh=1): 0.0433911604792563

so it avoids the problem with original lam().  however:

Llam(l=10, fp=1.5, fn=5, ns=1, nh=1): 0.0168115579465237
Llam(l=10, fp=1.0, fn=20, ns=1, nh=1): 0.0134020941849576

I think this is a problem.  IMO a 20% FN rate/1.0% FP rate should not score
better than 5% FN/1.5% FP.  it's good drop of the FP rate, sure -- but a filter
with 20% FNs is unusable. :(

so:

  TCR: varies widely based on size of corpora
  F(): good FP rates can mask terrible FN rates
  lam(): treats FPs and FNs as equal, no concept of lambda
  Llam(): again, good FP rates can mask terrible FN rates

we still don't have a good single-figure metric imo.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-11 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-11 01:30 ---
Re: divide by 0
That's a computational issue, not a theoretical one. If there are no FPs or FNs,
then it makes sense for TCR (or F) to be infinite.

You just need to remember what TCR is, a ratio between the costs of dealing with
spam without and with a spam filter. 
(http://wiki.spamassassin.org/TotalCostRatio)



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-11 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-11 04:15 ---
(In reply to comment #19)
 Re: divide by 0
 That's a computational issue, not a theoretical one. If there are no FPs or 
 FNs,
 then it makes sense for TCR (or F) to be infinite.

it may make sense from the POV of TCR as an abstract concept, but it doesn't
make much sense for us to use it if it breaks like that. ;)



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-08 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-07 23:47 ---
 if FP and FN are 0 -- ie there were no misclassifications [...]
  -- it yields a division by zero

That's an idea! Implement it and spammers will start producing perfectly
classifiable spam in an attempt to crash the filters with divide by zero errors.
We can foil them by leaving in a small bug to make perfection impossible, but
stay as close to perfect as we want!




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-07 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-07 10:46 ---
(In reply to comment #10)
 I propose the following new
 measurement. Let's call it the Findlay measurement:
 
 F(lambda) = 1 / (FN% + FP% * lambda)
 
 (Actually as defined above it's a function.)
 
 This is exactly equal to TCR on a balanced (50/50) corpus and doesn't have the
 undesirable properties you mentioned.

it's not bad, but it needs work I'm afraid.  For example,
take a look at this:

  fn=3 fp=1.5 f=0.0128205128205128
  fn=18 fp=1.2 f=0.0128205128205128

in other words, if the FP% in this case dropped from 1.5% to 1.2%, that would
be enough to disguise the FN% rate dropping from 3% to *18%*. ;)





--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-07 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-07 11:07 ---
That's exactly right, one FP% is worth 50 FN%, if you set lambda = 50. If you
don't like that tradeoff, pick a different value for lambda. :-)

I agree it gets ugly when you have that many FPs...



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-07 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-07 11:25 ---
I should point out also that where I wrote FN% and FP% I really meant the
proportion of FN/FPs and not percentage, so you need to divide the percentage by
100 before sticking it in the formula (or multiply the end result by 100). You
end up with F=1.28, not F=0.0128.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-07 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-07 12:51 ---
Btw, the CEAS 2007 contest used the following metrics:

  Filters will be evaluated using the lam() metric. Lam() calculates the
  average of a filter's false-negative and false-positive rates, but performs
  the calculation in logit space. The exact formula used for the competition
  will be: 
FPrate = (#ham-errors  + 0.5) / (#ham  + 0.5)
FNrate = (#spam-errors + 0.5) / (#spam + 0.5)
lam = invlogit((logit(FPrate) + logit(FNrate)) / 2)
  where 
logit(p) = log(p/(1-p))
invlogit(x) = (exp(x))/(1 + exp(x))
 The winner will be the filter with the lowest lam() score.


Here is an additional comment from Gordon V. Cormack
(on the [EMAIL PROTECTED] ML):


Lam is equivalent to diagnostic odds ratio which is
used extensively in the diagnostic test literature.

Intuitively, it rewards a fixed factor improvement
in either spam or ham missclassification rate by
the same amount.  So improving from .001 to .0009
fp gives the same score gain as from .01 to .009 fn.

Lam is *almost* the same thing as the geometric
mean of the two scores, which is perhaps more
intuitive for most.

Lam is discussed in the TREC 2005 proceedings
  http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05/

and DOR in the medical literature
  http://www.bmj.com/cgi/content/full/323/7305/157

In a nutshell, we are looking for a threshold-
independent measure that roughly captures the
overall discriminative ability of a filter.  It
turns out that lam works pretty well for this.
Better than, say accuracy which is wildly
threshold dependent; optimizing accuracy may
be totally inconsistent with good filtering.
While lam does not explicitly reward low fp
more than low fn, it does not actively
discourage it.

Other measures -- like ROC Area Under the
Curve -- could have been used but that would
have required filters to return scores rather
than categorical ham/spam judgements and we
considered this logistically unrealistic.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-07 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-07 13:51 ---
lam sounds promising -- let's look into that.

btw there's another issue with the F(lambda) idea -- if FP and FN
are 0 -- ie there were no misclassifications, which can happen on
small test corpora -- it yields a division by zero.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-08-01 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-08-01 04:59 ---
here's another machine-learning Challenge --

  http://challenge.spock.com/pages/learn_more

$50k prize on this one.  I doubt we could match that ;)



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-07-06 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-07-06 04:58 ---
(In reply to comment #8)
 Re: Bayes rules.
 No, they should not be immutable. If you want, we can require them to be 
 sane
 for some definition of sane. There's no compelling reason for them to be the
 exact values they are.

Possibly not the exact values they are right now.  But I think we have to
disagree that they need to be immutable; I really would prefer that they are. 
Other developers' thoughts would be welcome here...

 Re: Specifying Score ranges
 This is pretty tricky, I'm not sure what we can do here. The problem is that
 score ranges are inherently non-mathematical, really more of a shot in the 
 dark,
 and there's no real way to evaluate them. Having a different corpus form the
 test set is probably a better real-world test of the algorithm, but it also 
 adds
 a whole lot of luck (I think). If we were evaluating *algorithms* rather than
 submitted sets of scores, we could try something like a cross-fold validation
 but split into folds based on corpus as you suggest. I don't know if it would
 work in practice (or even in theory for that matter).

maybe that would be a good additional test step.  I agree 10fcv is
really the best way to check out an algorithm's workability...


 Re: Evaluation
 I'm not entirely sure what you're trying to say. Specifying a max FP rate and
 minimizing FNs given that rate is not the same as minimizing TCR. TCR builds 
 in
 a relative cost of FPs and FNs (namely lambda), and is probably a simpler
 criteria. I don't think we'll see really high FPs if we are trying to obtain
 optimal TCR with lambda = 50.

 (Remember that in terms of TCR(lambda=50), a score set with FP = 0.4% and FN =
 0% is equivalent to FP = 0% and FN = 20% (assuming 50-50 ham/spam mix).)

ah, here's another problem with TCR I'd forgotten about.
it varies based on the size of the corpora:

: jm 372...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 1000 -ham 1000
# TCR(l=50):  33.50
: jm 373...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 2000 -ham 2000
# TCR(l=50):  66.83
: jm 374...; masses/fp-fn-to-tcr -lambda 50 -fn 5 -fp 0.5 -spam 3000 -ham 3000
# TCR(l=50):  100.17

(I can't believe I forgot about that!)
I think we should avoid it in general; if we can't compare results because
the corpus changes size, that's a bad thing.  I know it's nice to have a
single figure, but I haven't found a good one yet; nothing that's as
comprehensible or easy to use as FP%/FN%.


 Re: Fire and Forget
 In development of these algorithms, it's much easier to get a process down 
 then
 automate it later. If we don't want to see/evaluate the results before 
 requiring
 people to have a fully automated system, we might be wasting effort.

ok, maybe so.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-07-06 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-07-06 13:43 ---
Re: Bayes immutability

All I'm really trying to say is that during scoring runs, we should be changing
the BAYES_ scores. We can manually make them sane if necessary, but we need to
change the values periodically to reflect the changing importance of the Bayes
rules relative to other rules. (In our LR research, we would have given BAYES_99
a score of 6 if we could have, so realistically a score of 4.5 would be fair
for BAYES_99. The best way to determine what it should be is using a scoring
mechanism.)

I can agree to disagree.

Re: TCR

TCR = number of spam / (number of fns + lambda * number of fps)

If you remember what TCR represents, it makes sense that TCR depends on the
relative ratio of ham/spam of the corpus. (If you don't,
http://wiki.spamassassin.org/TotalCostRatio -- but the wiki page really
complicates the calculation) It's still fine for ranking different algorthims on
the same corpus, but if you're upset about it, I propose the following new
measurement. Let's call it the Findlay measurement:

F(lambda) = 1 / (FN% + FP% * lambda)

(Actually as defined above it's a function.)

This is exactly equal to TCR on a balanced (50/50) corpus and doesn't have the
undesirable properties you mentioned.

(Ok, don't call it the Findlay measurement... it's a stupid name, and it's a
fairly trivial derivation...)



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-07-04 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-07-04 00:25 ---
Unfortunately it's been a while since I've looked at this stuff. (Actually, it's
been like 3 months... which is hardly a while, but it's been a busy 3 months...)

In no particular order:

- BAYES_* are marked as immutable right now (IIRC). This really limits
optimization in score sets 2 and 3.

- Score ranges need to be better defined. (Perhaps require that entries fit in
current score ranges?) If we don't clearly define/restrict score range, the best
submission will probably be the one with the least restricted scores. Score
ranges prevent scores from being over-optimized to our data set. Splitting our
data set into training and test sets doesn't really catch this
over-optimization, since both are part of our data set that has unique
characteristics. (I'm sure there are technical terms for this, I just don't
remember what they are...)

- I already have a copy of the test set (if I can find it). Does that make me
ineligible? :-)

- By requiring scores in the current format, we are eliminating a whole class of
scoring systems. For example, suppose I wanted to try a decision tree system to
detect spam based on SpamAssassin rules (this would obviously work very poorly),
it would be impossible to convert this into a set of scores.
  - The LR experiments Steve and I did relied on a logarithmic decision rule
(i.e. a message is spam if 1 / 1 + exp^(-(scores * rule_hits))  probability
threshold). This is easy to convert into traditional SpamAssassin scores using
algebra, but other systems may not be.
  - If we scrap the requirements for output to be in terms of current
SpamAssassin scores, our score ranges problem becomes more significant -- score
ranges don't mean anything if we're not talking about traditional SpamAssassin
scores.
  - Ask me if this isn't clear -- it's tricky to explain.

- Our evaluation criteria is currently undefined. We need a clear, single
measurement to decide on a winner. (In our research, we used TCR on the test set
with lambda = 50 as our goal criteria.) Depending on how/if we resolve the
previous point, we need to set a threshold value (for example 5.0) as our sole
test point.

- Do you think people are actually going to be interested in this enough in
order to devote a good chunk of time toward it? I hope so...

Makes me think I should have submitted a talk to ApacheCon... it'd be a great
way to kick off this contest.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-07-04 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-07-04 20:18 ---
Re: Bayes rules.
No, they should not be immutable. If you want, we can require them to be sane
for some definition of sane. There's no compelling reason for them to be the
exact values they are.

Re: Specifying Score ranges
This is pretty tricky, I'm not sure what we can do here. The problem is that
score ranges are inherently non-mathematical, really more of a shot in the dark,
and there's no real way to evaluate them. Having a different corpus form the
test set is probably a better real-world test of the algorithm, but it also adds
a whole lot of luck (I think). If we were evaluating *algorithms* rather than
submitted sets of scores, we could try something like a cross-fold validation
but split into folds based on corpus as you suggest. I don't know if it would
work in practice (or even in theory for that matter).

Re: Evaluation
I'm not entirely sure what you're trying to say. Specifying a max FP rate and
minimizing FNs given that rate is not the same as minimizing TCR. TCR builds in
a relative cost of FPs and FNs (namely lambda), and is probably a simpler
criteria. I don't think we'll see really high FPs if we are trying to obtain
optimal TCR with lambda = 50.

(Remember that in terms of TCR(lambda=50), a score set with FP = 0.4% and FN =
0% is equivalent to FP = 0% and FN = 20% (assuming 50-50 ham/spam mix).)

Re: Fire and Forget
In development of these algorithms, it's much easier to get a process down then
automate it later. If we don't want to see/evaluate the results before requiring
people to have a fully automated system, we might be wasting effort.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-07-03 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-07-03 07:40 ---
ok, I wrote up a spec of what we currently require:

http://wiki.apache.org/spamassassin/SpamAssassinChallenge

does this look like a reasonable idea?  Duncan?



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-07-03 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-07-03 08:27 ---
I think our current generate scores once a millenia method is not really
functional for several reasons, but mainly that it takes too long between
updates and the scores are overly tuned for our messages and not necessarily
what other people see (despite having 42 rsync accounts, we have few (8 (but
really more like 6.5) at current count) people doing nightly runs, and I'm ~50%
of the total results).

Also, the last thing I saw about the perceptron, and why it didn't work for 3.2,
was that it needed more diverse data to score with.  Perhaps we just need better
rules?

Anyway, coming up with a new scoring algorithm isn't bad, I just don't think
it's going to make a lot of difference without other changes as well, such as
more frequent runs, more corpus runs/results from the masses, and/or a way for
individual locations to easily update their own scores.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-07-03 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-07-03 08:23 ---
Should there be something in there about code that can be used without any sort
of patent issues and available under the Apache License?



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-07-03 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





--- Additional Comments From [EMAIL PROTECTED]  2007-07-03 08:43 ---
(In reply to comment #3)
 I think our current generate scores once a millenia method is not really
 functional for several reasons, but mainly that it takes too long between
 updates and the scores are overly tuned for our messages and not necessarily
 what other people see (despite having 42 rsync accounts, we have few (8 (but
 really more like 6.5) at current count) people doing nightly runs, and I'm 
 ~50%
 of the total results).
 
 Also, the last thing I saw about the perceptron, and why it didn't work for 
 3.2,
 was that it needed more diverse data to score with.  Perhaps we just need 
 better
 rules?

well, the GA did pretty well, so we came to the conclusion that the data was ok
-- just not the kind of data the perceptron's gradient-descent algorithm did
well with.

 Anyway, coming up with a new scoring algorithm isn't bad, I just don't think
 it's going to make a lot of difference without other changes as well, such as
 more frequent runs, more corpus runs/results from the masses, and/or a way for
 individual locations to easily update their own scores.

sure.  The benefit of this, however, is that we can define the problem and then
*other people* will do it for us ;)

Michael, good point.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 5376] RFE: generate a SpamAssassin Challenge score-generation test

2007-04-26 Thread bugzilla-daemon
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376


[EMAIL PROTECTED] changed:

   What|Removed |Added

OtherBugsDependingO||4560
  nThis||






--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.