http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From [EMAIL PROTECTED]  2005-07-31 18:42 -------
SM> It's tricky getting a good corpus: ...

In addition to your reasons, a good corpus for local use (it's spam here, and
always spam here) may not be good for global use (it's not spam to users on that
other system over there). And to expand on your
SM> There are people who [sa-learn as spam] not because they are clueless, but
if they don't recognize that something comes from a subscription or just aren't
sure, ...
There are also sources that confound matters -- a user can sign up with them for
one brand, and receive emails from a corporate parent with a different domain 
name.

SM> And there's Constant Contact who may have found a way around what at first
glance appears to be a good defense against spam.

SM> ... if Constant Contact really is doing that, they must be counting on
low numbers of complaints. 

Apparently they are, based on the large number of cc.com emails here that
qualify for the BSP rules. 

SM> That link I posted to Ironport's site listed the Bonded Sender fees as of
two years ago. It makes it risky for a single customer to spam. But I can see
how Constant Contact could have a business model based on getting paid by a mix
of spammers and hammers. The Bonded Sender fines are based on number of
complaints per million mails. If you want to nail them, get aggressive about
reporting the confirmed RCVD_IN_BSP_TRUSTED spam. ...

My family gets a lot more ham than spam from cc.com, and so in the past on those
rare occasions when we've gotten cc.com spam I've gone directly to them, with
satisfactory results. Given what I'm seeing now in this corpus, I'll send in the
formal complaints to BSP/Ironport, to increase cc.com's incentive to police
their customers. 

SM> So how do you have a clean corpus when it could contain edge cases that are
classified wrong? ...

Or, IMO more correctly, a valid and representative corpus used for scoring
/should/ have edge cases that may or may not be classified wrong -- there's no
other way for a major ISP who can't know what their users did or didn't
subscribe for, to manage their spam. It's important to classify them as
accurately as humanly possible, but for SA to be optimally useful it needs to be
able to make judgments about the edge cases as well, and it can only do that if
we take the risk and include them in our corpus. 

SM> What is the "correct" score for such mail? If the only difference between a
piece of spam and a piece of ham is whether the recipient subscribed to it, how
do you call either one an FP or an FN for the purpose of the rule scoring
program? I don't have answers to that.

First pass suggestion:  Aim to get these "edge" emails into the 2.0-4.0 score
range, so that network tests and hopefully Bayes can push them over 5.0 or under
0.0 as appropriate for the user/site. 





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to