Hello Loren,

Tuesday, July 26, 2005, 1:43:42 AM, you wrote:

>> More thought ... what if SA systems were to accumulate daily
>> statistics, along the lines of one record for each rule, containing:

LW> That sounds like the general sort of vague idea I had, fleshed out in more 
detail.
LW> Certainly the desirable goal is basically:

LW> 1 does this rule hit anything?
LW> 2 does it hit what it was supposed to hit?
LW> 3 does it look like a score adjustment might help, either up or down?
LW> 4 is this hitting something in a language that it wasn't intended to hit?

LW> I think to do that we need basically annonomous information,
LW> with the exception that we should know the primary site
LW> language(s) to help diagnose foreign language problems.

Rather than the primary site language(s), I'd be more interested in
the email's language.  ContractorsWarehouse.com understands only
English, but twice in the last five years we've received non-English
ham, from Europe from individuals who hoped (or maybe assumed) we
would understand their language.

"Email in Polish, spam" is more useful I think than "Home language
French, spam".


LW> In addition, I think the site should be able to optionally
LW> report a site contact address if they want to.  This could be
LW> useful if the stats indicate that they have a seemingly local rule
LW> that is doing really well.  There would be someone that we could
LW> write and ask if they would be willing to contribute it to the
LW> regular rules.

Good idea!

LW> Another thing that would be nice to get from sites would be
LW> rule overlap information.  I'm not sure how to accumulate this
LW> with any efficiency, nor how to report it compactly.  But with a
LW> good idea of rules hitting in the spam/ham categories, and a
LW> decent indication of rule overlap, it should be possible to
LW> generate theoretical scoring profiles that would work perhaps
LW> better than the default.

The way to get that would be to receive mass-check-like info, perhaps
a log of every email scanned, just the default (unmodified!)
X-Spam-Status line. These can then be examined for overlaps, freqs,
etc.

Bob Menschel


Reply via email to