http://bugzilla.spamassassin.org/show_bug.cgi?id=4505





------- Additional Comments From [EMAIL PROTECTED]  2005-07-30 12:43 -------
> Here's an email Bob sent to sa-dev mailing list that looks like it was meant 
> to
> be a comment here. Or if not, I think it should be in the record here and it 
> is
> on a public list so I feel free to repost it.

Agreed. Actually, this first comment was just back to the list; the
second was to the list cc bugz, but didn't get to bugz.  I'll try to
post directly to bugz on this subject going forward.

> However, 259 is a lot less than 792 so there still is a question why
> Bob has so many Bonded sender FPs.

My first analysis was on Henry's 10% extract from the log, going
strictly against the FN/FP warning extract from that. So the numbers
were significantly smaller than from my full corpus which Justin
reviewed.

> There are 259 emails from/via constantcontact.com
from that 10% extract
> which are treated as spam on my system, have been flagged as spam on
> my system (scores as high as 30's and 40's), have been encapsulated
> on delivery, have never been flagged by any user as not-spam, but,
> for the purposes of a world-wide mass-check, these
> constantcontact.com emails might be questionable.

> Note: Not all constantcontact.com is treated as spam here -- quite a
> few cc.com newsletters are subscribed to and seen as ham by their
> subscribers and the system. The ones I find above in the fns file are
> all from a set of eight newsletters which have regularly (almost
> always) been seen as spam, and no user has ever corrected that
> classification.

Per my later email, this is out of over 3000 constant contact emails,
split about 50/50 in my corpus. Of the 1500+ that are considered spam
here, half are considered FPs, so apparently the other half are being
flagged correctly regardless of my corpus. No problem there.

Motley fool: Sidney indicates they're ham; I can't argue with him.
Treated as spam here because a) a user intentionally flagged it as
spam into sa-learn, b) they seem to me to be spam, based on the
contents, c) I'm not familiar with that service myself, and d) I don't
have time to research all of the sources of emails which get flagged
as spam. In my corpus, 22 from this source are flagged as spam (2 via
sa-learn), 26 as ham, 40 as unclassified.

About 80% of my BSP-trusted hits, spam, ham, and apparently also not
classifed, are through constant contact. Given Sidney's discovery and
comment re: constantcontact, I'm fairly convinced that /some/ of the
cc BSP-trusted emails in my corpus are spam. But I can't be absolutely
sure which (I'd be willing to put money down on about a dozen of them
that I reviewed yesterday, even after our discussions here, but given
our discussions here, only that dozen or so).

Not all of my cc emails, of cource, are BSP-Trusted. Those other also
fall on all sides of the ham/spam/unclassified groupings, and while I
haven't done stats on them, it feels from a quick glance as if the
ratio is about the same.

My corpus comes mostly from an aggressive ISP system, where a) a lot
of spam from known spam sources is dropped before SA, b) there are a
number of additional exim filters which put additional headers into
emails for SA to analyze, c) we have an additional Bayes analysis
system outside SA which gives additional feedback concerning whether
an email is/isn't spam, d) we have additioanl custom rules that review
the outputs of (b) and (c) in determining the SA score, e) we use most
of the not-high-risk SARE rules, f) we have a large number of
technical users very familiar with spam/anti-spam concerns and very
able to sa-learn their own emails, g) we have a large number of other
(not so technical) users, many of whom use this service specifically
because of its aggressive anti-spam stance, many of whom do actively
sa-learn also, and h) a fair number of users who do no sa-learn.

Because of the aggressive stance, we do have a higher FP ratio than
many other systems. Importantly, we don't have any complaints about
that. Again, we do drop emails before they even get to SA, but those
that get to SA all get delivered to the users, with spam encapsulated.
Some FPs are corrected via sa-learn, as are many FNs.

All FPs and FNs are trapped and entered into my corpus. The number
that I then discard on review afterwards is small -- a handful each
month.

I also trap and enter those emails which are flagged as ham (negative
scores) or spam (scores over 5) by BOTH SA and one of our internal
systems. I review both of these categories, but because of the numbers
I don't manually validate each and every one. I do review the ham more
carefully than the spam.

These practices may be where the discrepancy comes from -- my reliance
on others to manually validate ham/spam via sa-learn, my acceptance
of their determination when I do not have contradicting evidence
myself, and my acceptance with careful but not paranoid review of
automated classification when two or more classification systems
agree.

I'll be reviewing the BSP-other and HABEAS_ACCREDITED_COI spam hits
later today.

Meanwhile, though I have confidence that my corpus is reasonably
accurate, I also have no problem with it being discarded if my
methodology above is insufficient for scoring purposes.

The two questions, one asked by Henry:
> Is Bob's data really noisy or is it really hard?
and, what is the definition of "spam" as it should be applied to
scoring? Is there any room in there for end user perception (I didn't
ask for this), or does it accept mail as ham if the user ever at
any time opted in for any mail from the sender, even mail which does
not properly relate to the reason the user wanted the email?

Again, I have no problem with my corpus (or any subset of it) being
discarded. I'm also willing to work on improving my methodologies for
3.2's rescoring run.

Bob Menschel





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to