http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376





------- Additional Comments From [EMAIL PROTECTED]  2007-08-14 06:03 -------
here's a version of lam() with a lambda calculation...

#!/usr/bin/perl
# http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376#c16
my ($lambda, $fppc, $fnpc, $nspam, $nham) = @ARGV;
my $fprate = ((($fppc * $nham) / 100) + 0.5) /  ($nham + 0.5);
my $fnrate = ((($fnpc * $nspam) / 100) + 0.5) / ($nspam + 0.5);
sub logit { my $p = shift; return log($p / (1-$p)); }
sub invlogit { my $x = shift; return exp($x) / (1 + exp($x)); }
my $llam = invlogit (($lambda * logit($fprate) + logit($fnrate)) / ($lambda + 
1));
print "Llam(l=$lambda, fp=$fppc, fn=$fnpc, ns=$nspam, nh=$nham): $llam\n";



some results:

Llam(l=10, fp=1, fn=5, ns=10000, nh=10000): 0.011653428823489
Llam(l=10, fp=2, fn=5, ns=10000, nh=10000): 0.0218100293576761
Llam(l=10, fp=5, fn=1, ns=10000, nh=10000): 0.0433911604792563

so it avoids the problem with original lam().  however:

Llam(l=10, fp=1.5, fn=5, ns=10000, nh=10000): 0.0168115579465237
Llam(l=10, fp=1.0, fn=20, ns=10000, nh=10000): 0.0134020941849576

I think this is a problem.  IMO a 20% FN rate/1.0% FP rate should not score
better than 5% FN/1.5% FP.  it's good drop of the FP rate, sure -- but a filter
with 20% FNs is unusable. :(

so:

  TCR: varies widely based on size of corpora
  F(): good FP rates can mask terrible FN rates
  lam(): treats FPs and FNs as equal, no concept of lambda
  Llam(): again, good FP rates can mask terrible FN rates

we still don't have a good single-figure metric imo.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to