http://bugzilla.spamassassin.org/show_bug.cgi?id=4505
------- Additional Comments From [EMAIL PROTECTED] 2005-07-30 12:43 ------- > Here's an email Bob sent to sa-dev mailing list that looks like it was meant > to > be a comment here. Or if not, I think it should be in the record here and it > is > on a public list so I feel free to repost it. Agreed. Actually, this first comment was just back to the list; the second was to the list cc bugz, but didn't get to bugz. I'll try to post directly to bugz on this subject going forward. > However, 259 is a lot less than 792 so there still is a question why > Bob has so many Bonded sender FPs. My first analysis was on Henry's 10% extract from the log, going strictly against the FN/FP warning extract from that. So the numbers were significantly smaller than from my full corpus which Justin reviewed. > There are 259 emails from/via constantcontact.com from that 10% extract > which are treated as spam on my system, have been flagged as spam on > my system (scores as high as 30's and 40's), have been encapsulated > on delivery, have never been flagged by any user as not-spam, but, > for the purposes of a world-wide mass-check, these > constantcontact.com emails might be questionable. > Note: Not all constantcontact.com is treated as spam here -- quite a > few cc.com newsletters are subscribed to and seen as ham by their > subscribers and the system. The ones I find above in the fns file are > all from a set of eight newsletters which have regularly (almost > always) been seen as spam, and no user has ever corrected that > classification. Per my later email, this is out of over 3000 constant contact emails, split about 50/50 in my corpus. Of the 1500+ that are considered spam here, half are considered FPs, so apparently the other half are being flagged correctly regardless of my corpus. No problem there. Motley fool: Sidney indicates they're ham; I can't argue with him. Treated as spam here because a) a user intentionally flagged it as spam into sa-learn, b) they seem to me to be spam, based on the contents, c) I'm not familiar with that service myself, and d) I don't have time to research all of the sources of emails which get flagged as spam. In my corpus, 22 from this source are flagged as spam (2 via sa-learn), 26 as ham, 40 as unclassified. About 80% of my BSP-trusted hits, spam, ham, and apparently also not classifed, are through constant contact. Given Sidney's discovery and comment re: constantcontact, I'm fairly convinced that /some/ of the cc BSP-trusted emails in my corpus are spam. But I can't be absolutely sure which (I'd be willing to put money down on about a dozen of them that I reviewed yesterday, even after our discussions here, but given our discussions here, only that dozen or so). Not all of my cc emails, of cource, are BSP-Trusted. Those other also fall on all sides of the ham/spam/unclassified groupings, and while I haven't done stats on them, it feels from a quick glance as if the ratio is about the same. My corpus comes mostly from an aggressive ISP system, where a) a lot of spam from known spam sources is dropped before SA, b) there are a number of additional exim filters which put additional headers into emails for SA to analyze, c) we have an additional Bayes analysis system outside SA which gives additional feedback concerning whether an email is/isn't spam, d) we have additioanl custom rules that review the outputs of (b) and (c) in determining the SA score, e) we use most of the not-high-risk SARE rules, f) we have a large number of technical users very familiar with spam/anti-spam concerns and very able to sa-learn their own emails, g) we have a large number of other (not so technical) users, many of whom use this service specifically because of its aggressive anti-spam stance, many of whom do actively sa-learn also, and h) a fair number of users who do no sa-learn. Because of the aggressive stance, we do have a higher FP ratio than many other systems. Importantly, we don't have any complaints about that. Again, we do drop emails before they even get to SA, but those that get to SA all get delivered to the users, with spam encapsulated. Some FPs are corrected via sa-learn, as are many FNs. All FPs and FNs are trapped and entered into my corpus. The number that I then discard on review afterwards is small -- a handful each month. I also trap and enter those emails which are flagged as ham (negative scores) or spam (scores over 5) by BOTH SA and one of our internal systems. I review both of these categories, but because of the numbers I don't manually validate each and every one. I do review the ham more carefully than the spam. These practices may be where the discrepancy comes from -- my reliance on others to manually validate ham/spam via sa-learn, my acceptance of their determination when I do not have contradicting evidence myself, and my acceptance with careful but not paranoid review of automated classification when two or more classification systems agree. I'll be reviewing the BSP-other and HABEAS_ACCREDITED_COI spam hits later today. Meanwhile, though I have confidence that my corpus is reasonably accurate, I also have no problem with it being discarded if my methodology above is insufficient for scoring purposes. The two questions, one asked by Henry: > Is Bob's data really noisy or is it really hard? and, what is the definition of "spam" as it should be applied to scoring? Is there any room in there for end user perception (I didn't ask for this), or does it accept mail as ham if the user ever at any time opted in for any mail from the sender, even mail which does not properly relate to the reason the user wanted the email? Again, I have no problem with my corpus (or any subset of it) being discarded. I'm also willing to work on improving my methodologies for 3.2's rescoring run. Bob Menschel ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
