Re: PROPOSAL: create "SpamAssassin Rules Project"

Frank Heydlauf Wed, 20 Jul 2005 10:30:46 -0700

Hi Duncan,

On Wed, Jul 20, 2005 at 12:41:48AM -0400, Duncan Findlay wrote:
> 
> I think the first point is the bigger one. Ultimately, Dan's sandbox
> proposal may solve part of the "not enough rules" problem by making it
> easier for people to contribute rules. But I'd like to hear from
> potential rule submitters -- would this be a step in the right
> direction? Is this something that you would be on board with? Would
> you be more inclined to contribute rules?


Maybe a bit off-topic, on the other hand... see blow.

1)
What I miss most is a transparent dataset about every rule.
I'd like to know
- percentage of false positives
- percentage of flase negatives
- percentage of true positives
- percentage of true negatives
- number of mails checked for the results above
- standard deviation of the percentages obove

This numbers should be available for masses in 
different regions and languages, i.e. Europe/English,
Europe/German since there are big differences
in the effectivity of rules.

2) 
Detection of redundancy or linear independency.
Is my new rule covered or disabled by another rule
or does it affect existing rules?
This could be detected which a MassCheck. 

3)
As Loren said before, new rules becomes unuseful when
posted on the list.

If you implement 1), this could give a strong
feedback and motivation to the rule distributors.
If you collect the statistics automatically from
registrated (trustable) servers, you not even would
have to make your own mass checks!
Benefit for the user: Very fast feedback about which
rules are actually useful.

About 2): I sometimes wonder if my rules are really
useful. This could be an indicator. Since I don't 
want to commit unusefull rules this may help, even
if it's only a small point.

Top 3) is a very problematic one. 
The only way (keeping the source open) I can see, is
to react very fast, very flexible and very individual.

This is a "goto 1)". If I have a big pool of rules
where I myself can decide which one to take and which
not - based on real facts, not on guessing - this
would be a great improvement.

My idea about this is to send a FN to a reference
server, see which (even very new und little tested)
rules matches, look at the statistics and decide
to include it or not - or - if no rule matches, to 
provide one.
For each rule a set of matching spam-mails should
be stored by the author to cross check other rules
for linear dependencies.

Sadly the actual used model of scoring is not helpful
for this approach :( It would be much better to have
a real statistical scoring where I just could multiply
the probabilities of each used rule to get a result.
This result would tell me: This is 99% Spam and
the probability of beeing false is is 0.3%, based
on the mass europe/german.
The statistical scoring could be calculated directly
and fast from the feedback in 1) and/or with a MC
and - don't underestimate this - this approach would
make it *much* easier and more accurate to include
external modules like NiX-Spam:
http://www.heise.de/ix/nixspam/
http://www.bonengel.de/index.php?id=7
Even the Bayes-Classifier would be much easier to
score and you'll no longer need 4 different scorings
for 'w/ and w/o bayes', 'w/ and w/o network'.
Be aware you'll double the number of scorings
with each new class of tests you implement in 
the actual scoring model!

I know the proposed change in scoring would be a
really big step but I think it's absolutely necessary
to be prepared for flexible and fast future
developement.

-- 
Regards
Frank

Re: PROPOSAL: create "SpamAssassin Rules Project"

Reply via email to