http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5096
Summary: replace some mass-check spam corpora with spamtrap data
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Platform: Other
OS/Version: other
Status: NEW
Severity: normal
Priority: P5
Component: Score Generation
AssignedTo: [email protected]
ReportedBy: [EMAIL PROTECTED]
This is an issue that I want to get into BZ so I don't forget it.
I think some of our mass-check corpora are no longer receiving representative
spam feeds.
Due to spam volume, many of us no longer accept all the spam that is sent to
our MXes -- the "easy" spam is being rejected during the SMTP conversation, and
therefore never makes it into our corpus. For example, my MX is now using
SBL+XBL during the SMTP conversation, rejecting about 40% of the incomng spam
to jmason.org. It looks like other mass-checkers are doing something similar,
based on the network rule hit-rates on one corpus compared to another:
http://ruleqa.spamassassin.org/20060902-r439560-n/RCVD_IN_XBL/detail#DETAILS_all_mass_check_date_rev_20060902_r439560_n
This is a problem, since the score generation process relies on having
a "representative" selection of spam and ham, and if half of the "easy" spam
is not in the corpus, that's not happening.
I suggest that we should stop mass-checks of 'problematic' corpora and replace
them with (reliable, carefully vetted, bounce-filtered) spamtrap data.
I also suggest that these spamtraps be set up with some kind of limited
SpamAssassin ruleset, so that they can record "live" network rule results
on the trapped mails.
Theo noted --
> FWIW, my personal mail and my spamtraps have no filtering other than SA.
> I can create new/share some of my current spamtrap addresses if people
> want to "spread them around" more than I have (which isn't a lot).
(ps: theo, is /home/corpus/SA/corpus/ham/hamtrap/2006/09/01/8fc9d5de19
a ham? could you verify? it hits XBL in the mass-check results above)
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.