Hello Loren,

Tuesday, July 26, 2005, 1:29:24 AM, you wrote:

>> How would we determine ham/spam?  At this point all we have is SA's
>> first estimation, and no way of knowing whether this is accurate, FN,
>> or FP.

LW> All we could reasonably do is take SA's assment of the
LW> message and assume that statistically it will be correct to one or
LW> two sigma or so.  If the reporting site really does have a huge
LW> percentage of FPs or FNs, it will screw the reporting up some. 
LW> But most sites should be running fairly cleanly, so we should be
LW> able to assume with about 95% accuracy that the assessment of what
LW> kind of mail the rule hit was correct.

Better would be to ensure we have an identifier in the feedback
indicating the site it came from, and to track "reliability"
percentages per site some how.  Sites that feed SA list emails through
SA will often have less reliability, for instance, than sites that
don't.

For that matter, perhaps we should include rules that test
specifically for SA, SARE, Spam-L, SURBL, and similar anti-spam lists,
and flag those so we can throw those emails away during analysis.

LW> One of the main goals here would be to look for rules that we
LW> think are supposed to hit spam, but the reporting site claims 25%
LW> of the time hit ham.  This would be a clear indication that a) the
LW> site is terribly screwed up, b) the rule is terribly screwed up if
LW> the site reports that it speaks English, c) the rule doesn't work
LW> well in whatever language the site reports it uses.

Again, I favor not worrying too much about the native language, and
worry instead about the email's language. Therefore

(c) if the rule hits ham 20%-30% across many sites with high
reliability, (a) if only one or just a few sites have this problem.

LW> Clearly the obverse situation also applies.

LW> This is why I also want the reports to contain an indication
LW> of the language and/or geographical location of the site, so we
LW> can spot the foreign language problems.  If the report also
LW> contained some indication of the percentage of mails that were
LW> submitted for learning and showed signs of being mis-classified
LW> (if that information is obtainable) it would give an indication of
LW> the reliability of the classification data from the site.

That information should be available -- sa-learn should be able to
read the spam status headers to determine the original score/status,
and to report for each sa-learn whether this is "confirmation" (was
spam, is learned as spam), "education" (wasn't sure, now is sure), or
"correction" (was spam, is learned as ham).

Bob Menschel


Reply via email to