Duncan earlier enscribed:

>> Masscheck has an interdependency option, although it increases the
checking
>> time.  We use it on rules once they seem useful, but not usually in early
>> one-off checking.
>
>I'm not sure what you mean by this. We have an "overlap" script which
>does some of this -- is that what you're talking about?

Yes.  Useful once we have a seemingly usable rule or number of rules.  They
can be checked to see if they actually improve any on the existing rules.
But generally not worth the time when developing rules initially and
whittling down unexpected FPs and the like.


>> This is a very interesting idea that I think needs more exploring in the
>> future.  Any SA server that has a Bayes database potentially has most of
the
>> knowledge to be able to participate in Seti-like background processing
for
>> determining rule hit ratios.  For that matter, any SA server should be
able
>
>Interesting, I agree. I'm not sure this will help at all with new rule
>development, but it would give us interesting data over relative hit
>rates over time.
>It would certainly be lots of work to set up, though. :-(

It wouldn't directly help with new rule development, but it could indicate
rules that can be retired to (at least temporary) limbo, and if the right
kind of information can be obtained, might indicate rules that should have
score changes either to reduce fp chances (or actualities) or to improve
catching certain types of spam.  Possibly the data might occasionally even
suggest areas for new rules or modified rules, by combining the data with a
knowledge of current spam and what the various rules do.

>It would certainly be lots of work to set up, though. :-(

I'm not absolutely sure of the "lots" part.  What it will take is some
concerted thought devoted to what can be gathered, what sysadmins would fell
comfortable with letting a third party (us) gather, and how one might go
about automating the process.

I'm generally envisioning a release component of SA that can be turned on at
site option that would somehow agglomerate hit information over a day or so,
and then at some random/specified time connect to an Apache server and
upload the information, possibly with ftp or the like.  Or maybe even as
text in an email to a special SA address.

Likely the information collected would be much like the current hit
statistics now being logged by SA.  It might even be exactly that
information.  Clearly some thought needs to be devoted to what should be
collected that would be of the most use, and this indeed constitutes work.

Beyond that though, the major effort would be writing the SA components that
could collect the information and then send it back to where it can be used,
and at some point some accumulation programs that can collect the individual
reports and boil the statistics down to something managable.  Just speaking
through my hat, I don't see the collection/reporting component as a major
effort, as the collection part probably mostly already exists, and I'd think
the collection sending part should be moderately simple.

Doing something with the data might be more complex.  But a base-level thing
that just collected the last 24 hours reports and summed the hits per rule
should be fairly simple to create, I would guess.

Of course there is lots more information that should be collectable, like
score value of each rule at the site, number of times a rule hits on ham vs
hits on spam, whether user rules are enabled, allowed_languages, maybe
country location of site, etc.  Optionally (at sysadmin option) the report
could contain contact information, so that if a site has a local rule that
seems to be doing remarkably well, someone could ask them if they would be
willing to submit it to the general corpus.  Etc.  But all of that could
come later.

        Loren

Reply via email to