Hello Loren, Tuesday, July 26, 2005, 1:43:42 AM, you wrote:
>> More thought ... what if SA systems were to accumulate daily >> statistics, along the lines of one record for each rule, containing: LW> That sounds like the general sort of vague idea I had, fleshed out in more detail. LW> Certainly the desirable goal is basically: LW> 1 does this rule hit anything? LW> 2 does it hit what it was supposed to hit? LW> 3 does it look like a score adjustment might help, either up or down? LW> 4 is this hitting something in a language that it wasn't intended to hit? LW> I think to do that we need basically annonomous information, LW> with the exception that we should know the primary site LW> language(s) to help diagnose foreign language problems. Rather than the primary site language(s), I'd be more interested in the email's language. ContractorsWarehouse.com understands only English, but twice in the last five years we've received non-English ham, from Europe from individuals who hoped (or maybe assumed) we would understand their language. "Email in Polish, spam" is more useful I think than "Home language French, spam". LW> In addition, I think the site should be able to optionally LW> report a site contact address if they want to. This could be LW> useful if the stats indicate that they have a seemingly local rule LW> that is doing really well. There would be someone that we could LW> write and ask if they would be willing to contribute it to the LW> regular rules. Good idea! LW> Another thing that would be nice to get from sites would be LW> rule overlap information. I'm not sure how to accumulate this LW> with any efficiency, nor how to report it compactly. But with a LW> good idea of rules hitting in the spam/ham categories, and a LW> decent indication of rule overlap, it should be possible to LW> generate theoretical scoring profiles that would work perhaps LW> better than the default. The way to get that would be to receive mass-check-like info, perhaps a log of every email scanned, just the default (unmodified!) X-Spam-Status line. These can then be examined for overlaps, freqs, etc. Bob Menschel
